Sunday Query: use SPARQL and Python to fix typographical errors on Wikidata
My turn to make a #SundayQuery! As Harmonia Amanda just said in her own article, I was about explain how to make a Python script to fix the results of her query… But I thought I should start with another script, similar but shorter and easier to understand. The script for Harmonia is here, though.
On Thursday, I published an article about medieval battles, and since then, I did start to fix battle items on Wikidata. One of the most repetitive fixes is the capitalization of the French labels: as they have been imported from Wikipedia, the labels have an unnecessary capital first letter (“Bataille de Saint-Pouilleux en Binouze” instead of “bataille de Saint-Pouilleux en Binouze”)
The query
So first, we need to find all the items that have this typo:
SELECT ?item ?label WHERE { ?item (wdt:P31/(wdt:P279*)) wd:Q178561; rdfs:label ?label. FILTER((LANG(?label)) = "fr") FILTER(STRSTARTS(?label, "Bataille ")) }
Some basic explanations :
?item wdt:P31/wdt:P279* wd:Q178561 .
looks for items that are battles or subclasses of battles, just to be sure I’m not making changes to some book called “Bataille de Perpète-les-Olivettes”…- On the next line, I query the labels for the items
?item rdfs:label ?label .
and filter to keep only those in FrenchFILTER(LANG(?label) = “fr”) .
. As I need to use the label inside the query and not merely for display (as Harmonia Amanda just explained in her article), I cannot use thewikibase:label
, and so I use the semantic web standardrdfs:label
. - The last line is a
FILTER
, that keeps only those of the results that matches the function inside it. Here,STRSTARTS
checks if?label
begins with “Bataille “ .
As of the time I write this, running the query returns 3521 results. Far too much to fix it by hand, and I know no tool that already exists and would fix that for me. So, I guess it’s Python time!
The Python script
I love Python. I absolutely love Python. The language is great to put up a useful app within minutes, easily readable (It’s basically English, in fact), not cluttered with gorram series of brackets or semicolons, and generally has great libraries for the things I do the most: scraping webpages, parsing and sorting data, checking ISBNs [1] and making websites. Oh and making SPARQL queries of course[2] .
Preliminary thoughts
If you don’t know Python, this article is not the right place to learn it, but there are numerous resources available online[3] . Just make sure they are up-to-date and for Python 3. The rest of this articles assumes that you have a basic understanding of Python (indentation, variables, strings, lists, dictionaries, imports and “for” loops.), and that Python 3 and pip are installed on your system.
Why Python 3? Because we’ll handle strings that come from Wikidata and are thus encoded in UTF-8, and Python 2 makes you jump through some loops to use it. Plus, we are in 2016, for Belenos’ sake.
Why pip? because we need a non-standard library to make SPARQL queries, called SPARQLwrapper, and the easiest way to install it is to use this command:
pip install sparqlwrapper
Now, let’s start scripting!
For a start, let’s just query the full list of the sieges[4] :
#!/usr/bin/env python3 from SPARQLWrapper import SPARQLWrapper, JSON endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql" sparql = SPARQLWrapper(endpoint) sparql.setQuery(""" SELECT ?item ?label WHERE {{ ?item wdt:P31/wdt:P279* wd:Q178561 . ?item rdfs:label ?label . FILTER(LANG(?label) = "fr") . FILTER(STRSTARTS(?label, "Siège ")) . }} """) # Link to query: http://tinyurl.com/z8bd26h sparql.setReturnFormat(JSON) results = sparql.query().convert() print(results)
That’s quite a bunch of lines, but what does this script do? As we’ll see, most of this will be included in every script that uses a SPARQL query.
- First, we import two things from the SPARQLWrapper module: the SPARQLWrapper object itself and a “JSON” that it will use later (don’t worry, you won’t have to manipulate json files yourself.)
- Next, we create a “endpoint” variable, which contains the full URL to the SPARQL endpoint of Wikidata[5] .
- Next, we create a SPARQLWrapper object that will use this endpoint to make queries, and put it in a variable simply called “sparql”.
- We apply the setQuery function to this variable, which is where we put the query we used earlier. Notice that we need to replace
{
and}
by{{
and}}
:{
and}
are reserved characters in Python strings. sparql.setReturnFormat(JSON)
tells the script that what the endpoint will return is formated in JSON.results = sparql.query().convert()
actually makes the query to the server and converts the response to a Python dictionary called “results”.- And for now, we just want to print the result on screen, just to see what we get.
Let’s open a terminal and launch the script:
$ python3 fix-battle-labels.py {'head': {'vars': ['item', 'label']}, 'results': {'bindings': [{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q815196'}, 'label': {'type': 'literal', 'value': 'Siège de Pskov', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q815207'}, 'label': {'type': 'literal', 'value': 'Siège de Silistra', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q815233'}, 'label': {'type': 'literal', 'value': 'Siège de Tyr', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q608163'}, 'label': {'type': 'literal', 'value': 'Siège de Cracovie', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q1098377'}, 'label': {'type': 'literal', 'value': 'Siège de Narbonne', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q2065069'}, 'label': {'type': 'literal', 'value': 'Siège de Hloukhiv', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q4087405'}, 'label': {'type': 'literal', 'value': "Siège d'Avaricum", 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q2284279'}, 'label': {'type': 'literal', 'value': 'Siège de Fort Pulaski', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q4337397'}, 'label': {'type': 'literal', 'value': 'Siège de Liakhavitchy', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q4337448'}, 'label': {'type': 'literal', 'value': 'Siège de Smolensk', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q701067'}, 'label': {'type': 'literal', 'value': 'Siège de Rhodes', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q7510162'}, 'label': {'type': 'literal', 'value': 'Siège de Cracovie', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q23013145'}, 'label': {'type': 'literal', 'value': 'Siège de Péronne', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q10428014'}, 'label': {'type': 'literal', 'value': 'Siège de Pskov', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q3090571'}, 'label': {'type': 'literal', 'value': 'Siège du Hōjūjidono', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q3485893'}, 'label': {'type': 'literal', 'value': 'Siège de Fukuryūji', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q4118683'}, 'label': {'type': 'literal', 'value': "Siège d'Algésiras", 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q5036985'}, 'label': {'type': 'literal', 'value': 'Siège de Berwick', 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q17627724'}, 'label': {'type': 'literal', 'value': "Siège d'Ilovaïsk", 'xml:lang': 'fr'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q815112'}, 'label': {'type': 'literal', 'value': "Siège d'Antioche", 'xml:lang': 'fr'}}]}}
That’s a bunch of things, but we can see that it contains a dictionary with two entries:
- “
head
”, which contains the name of the two variables returned by the query, - and “
results
”, which itself contains another dictionary with a “bindings
” key, associated with a list of the actual results, each of them being a Python dictionary. Pfew…
Let’s examine one of the results:
{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q17627724'}, 'label': {'type': 'literal', 'value': "Siège d'Ilovaïsk", 'xml:lang': 'fr'}}
It is a dictionary that contains two keys (label
and item
), each of them having for value another dictionary that has a “value
” key associated with, this time, the actual value we want to get. Yay, finally!
Parsing the results
Let’s parse the “bindings” list with a Python “for” loop, so that we can extract the value:
for result in results["results"]["bindings"]: qid = result['item']['value'].split('/')[-1] label = result['label']['value'] print(qid, label)
Let me explain the qid = result[‘item’][‘value’].split(‘/’)[-1]
line: as the item name is stored as a full url (“https://www.wikidata.org/entity/Q815196” and not just “Q815196”), we need to separate each part of it that is between a ‘/’ character. For this, we use the “split()” function of Python, which transforms the string to a Python list containing this:
['https:', '', 'www.wikidata.org', 'entity', 'Q17627724']
We only want the last item in the list. In Python, that means the item with the index -1, hence the [-1]
at the end of the line. We then store this in the qid variable.
Let’s launch the script:
$ python3 fix-battle-labels.py Q815196 Siège de Pskov Q815207 Siège de Silistra Q815233 Siège de Tyr Q608163 Siège de Cracovie Q1098377 Siège de Narbonne Q2065069 Siège de Hloukhiv Q4087405 Siège d'Avaricum Q2284279 Siège de Fort Pulaski Q4337397 Siège de Liakhavitchy Q4337448 Siège de Smolensk Q701067 Siège de Rhodes Q7510162 Siège de Cracovie Q23013145 Siège de Péronne Q10428014 Siège de Pskov Q3090571 Siège du Hōjūjidono Q3485893 Siège de Fukuryūji Q4118683 Siège d'Algésiras Q5036985 Siège de Berwick Q17627724 Siège d'Ilovaïsk Q815112 Siège d'Antioche
Fixing the issue
We are nearly there! Now what we need is to replace this first proud capital “S” initial by a modest “s”:
label = label[:1].lower() + label[1:]
What is happening here? a Python string works like a list, so we take the part of the string between the beginning of the “label” string and the position after the first character (“label[:1]
”) and force it to lower case (“.lower()
”). We then concatenate it with the rest of the string (position 1 to the end or “label[1:]
”) and assign all this back to the “label
” variable.
Last thing, print it in a format that is suitable for QuickStatements:
out = "{}\tLfr\t{}".format(qid, label) print(out)
That first line seems barbaric? it’s in fact pretty straightforward: “{}\tLfr\t{}
” is a string that contains a first placeholder for a variable (“{}
”), then a tabulation (“\t
”), then the QS keyword for the French label (“Lfr
”), then another tabulation and finally the second placeholder for a variable. Then, we use the “format()
” function to replace the placeholders with the content of the “qid
” and “label
” variables. The final script should look like this:
#!/usr/bin/env python3 from SPARQLWrapper import SPARQLWrapper, JSON endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql" sparql = SPARQLWrapper(endpoint) sparql.setQuery(""" SELECT ?item ?label WHERE {{ ?item wdt:P31/wdt:P279* wd:Q178561 . ?item rdfs:label ?label . FILTER(LANG(?label) = "fr") . FILTER(STRSTARTS(?label, "Siège ")) . }} """) # Link to query: http://tinyurl.com/z8bd26h sparql.setReturnFormat(JSON) results = sparql.query().convert() for result in results["results"]["bindings"]: qid = result['item']['value'].split('/')[-1] label = result['label']['value'] label = label[:1].lower() + label[1:] out = "{}\tLfr\t{}".format(qid, label) print(out)
Let’s run it:
$ python3 fix-battle-labels.py Q815196 Lfr siège de Pskov Q815207 Lfr siège de Silistra Q815233 Lfr siège de Tyr Q2065069 Lfr siège de Hloukhiv Q2284279 Lfr siège de Fort Pulaski Q1098377 Lfr siège de Narbonne Q608163 Lfr siège de Cracovie Q4087405 Lfr siège d'Avaricum Q4337397 Lfr siège de Liakhavitchy Q4337448 Lfr siège de Smolensk Q701067 Lfr siège de Rhodes Q10428014 Lfr siège de Pskov Q17627724 Lfr siège d'Ilovaïsk Q23013145 Lfr siège de Péronne Q815112 Lfr siège d'Antioche Q3090571 Lfr siège du Hōjūjidono Q3485893 Lfr siège de Fukuryūji Q4118683 Lfr siège d'Algésiras Q5036985 Lfr siège de Berwick
Yay! All we have to do now is to copy and paste the result to QuickStatements and we are done.
Header image:
Photograph of typefaces by Andreas Praefcke (public domain)