Categories
Python (en) Wikidata (en) Wikimedia projects

Sunday Query : use SPARQL and Python to fix typographical errors on Wikidata

My turn to make a #SundayQuery! As Harmonia Amanda just said in her own article, I was about explain how to make a Python script to fix the results of her query… But I thought I should start with another script, similar but shorter and easier to understand. The script for Harmonia is here, though.

On Thursday, I published an article about medieval battles, and since then, I did start to fix battle items on Wikidata. One of the most repetitive fixes is the capitalization of the French labels: as they have been imported from Wikipedia, the labels have an unnecessary capital first letter (“Bataille de Saint-Pouilleux en Binouze” instead of “bataille de Saint-Pouilleux en Binouze”)

The query

So first, we need to find all the items that have this typo:

SELECT ?item ?label WHERE {
  ?item wdt:P31/wdt:P279* wd:Q178561 .
  ?item rdfs:label ?label . FILTER(LANG(?label) = "fr") .
  FILTER(STRSTARTS(?label, "Bataille ")) .
}

http://tinyurl.com/jljf6xr

Some basic explanations :

  • ?item wdt:P31/wdt:P279* wd:Q178561 .  looks for items that are battles or subclasses of battles, just to be sure I’m not making changes to some book called “Bataille de Perpète-les-Olivettes”…
  • On the next line, I query the labels for the items ?item rdfs:label ?label .  and filter to keep only those in French FILTER(LANG(?label) = “fr”) . . As I need to use the label inside the query and not merely for display (as Harmonia Amanda just explained in her article), I cannot use the wikibase:label, and so I use the semantic web standard rdfs:label.
  • The last line is a FILTER , that keeps only those of the results that matches the function inside it. Here, STRSTARTS  checks if ?label  begins with “Bataille “ .

As of the time I write this, running the query returns 3521 results. Far too much to fix it by hand, and I know no tool that already exists and would fix that for me. So, I guess it’s Python time!

The Python script

I love Python. I absolutely love Python. The language is great to put up a useful app within minutes, easily readable (It’s basically English, in fact), not cluttered with gorram series of brackets or semicolons, and generally has great libraries for the things I do the most: scraping webpages, parsing and sorting data, checking ISBNs ((I hope I’ll be able to write something about it sometime soon.)) and making websites. Oh and making SPARQL queries of course ((Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants.)).

Two snake charmers with a python and a couple of cobras.
Not to mention that the name of the language has a “snake charmer” side 😉

Preliminary thoughts

If you don’t know Python, this article is not the right place to learn it, but there are numerous resources available online ((For example, https://www.codecademy.com/learn/python or https://docs.python.org/3.5/tutorial/.)). Just make sure they are up-to-date and for Python 3. The rest of this articles assumes that you have a basic understanding of Python (indentation, variables, strings, lists, dictionaries, imports and “for” loops.), and that Python 3 and pip are installed on your system.

Why Python 3? Because we’ll handle strings that come from Wikidata and are thus encoded in UTF-8, and Python 2 makes you jump through some loops to use it. Plus, we are in 2016, for Belenos’ sake.

Why pip? because we need a non-standard library to make SPARQL queries, called SPARQLwrapper, and the easiest way to install it is to use this command:

pip install sparqlwrapper

Now, let’s start scripting!

For a start, let’s just query the full list of the sieges ((I’ve fixed the battles in the meantime 😉 )):

#!/usr/bin/env python3

from SPARQLWrapper import SPARQLWrapper, JSON

endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

sparql = SPARQLWrapper(endpoint)
sparql.setQuery("""
SELECT ?item ?label WHERE {{
  ?item wdt:P31/wdt:P279* wd:Q178561 .
  ?item rdfs:label ?label . FILTER(LANG(?label) = "fr") .
  FILTER(STRSTARTS(?label, "Siège ")) .
}}
""")  # Link to query: http://tinyurl.com/z8bd26h

sparql.setReturnFormat(JSON)

results = sparql.query().convert()

print(results)

That’s quite a bunch of lines, but what does this script do? As we’ll see, most of this will be included in every script that uses a SPARQL query.

  • First, we import two things from the SPARQLWrapper module: the SPARQLWrapper object itself and a “JSON” that it will use later (don’t worry, you won’t have to manipulate json files yourself.)
  • Next, we create a “endpoint” variable, which contains the full URL to the SPARQL endpoint of Wikidata ((And not the web access to the endpoint, which is just “https://query.wikidata.org/”)).
  • Next, we create a SPARQLWrapper object that will use this endpoint to make queries, and put it in a variable simply called “sparql”.
  • We apply the setQuery function to this variable, which is where we put the query we used earlier. Notice that we need to replace { and } by {{ and }} : { and } are reserved characters in Python strings.
  • sparql.setReturnFormat(JSON)  tells the script that what the endpoint will return is formated in  json.
  • results = sparql.query().convert() actually makes the query to the server and converts the response to a Python dictionary called “results”.
  • And for now, we just want to print the result on screen, just to see what we get.

Let’s open a terminal and launch the script:

$ python3 fix-battle-labels.py 
{'head': {'vars': ['item', 'label']}, 'results': {'bindings': [{'label': {'value': 'Siège de Pskov', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q815196', 'type': 'uri'}}, {'label': {'value': 'Siège de Silistra', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q815207', 'type': 'uri'}}, {'label': {'value': 'Siège de Tyr', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q815233', 'type': 'uri'}}, {'label': {'value': 'Siège de Cracovie', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q608163', 'type': 'uri'}}, {'label': {'value': 'Siège de Narbonne', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q1098377', 'type': 'uri'}}, {'label': {'value': 'Siège de Hloukhiv', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q2065069', 'type': 'uri'}}, {'label': {'value': "Siège d'Avaricum", 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q4087405', 'type': 'uri'}}, {'label': {'value': 'Siège de Fort Pulaski', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q2284279', 'type': 'uri'}}, {'label': {'value': 'Siège de Liakhavitchy', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q4337397', 'type': 'uri'}}, {'label': {'value': 'Siège de Smolensk', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q4337448', 'type': 'uri'}}, {'label': {'value': 'Siège de Rhodes', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q701067', 'type': 'uri'}}, {'label': {'value': 'Siège de Cracovie', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q7510162', 'type': 'uri'}}, {'label': {'value': 'Siège de Péronne', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q23013145', 'type': 'uri'}}, {'label': {'value': 'Siège de Pskov', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q10428014', 'type': 'uri'}}, {'label': {'value': 'Siège du Hōjūjidono', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q3090571', 'type': 'uri'}}, {'label': {'value': 'Siège de Fukuryūji', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q3485893', 'type': 'uri'}}, {'label': {'value': "Siège d'Algésiras", 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q4118683', 'type': 'uri'}}, {'label': {'value': 'Siège de Berwick', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q5036985', 'type': 'uri'}}, {'label': {'value': "Siège d'Ilovaïsk", 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q17627724', 'type': 'uri'}}, {'label': {'value': "Siège d'Antioche", 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q815112', 'type': 'uri'}}]}}

That’s a bunch of things, but we can see that it contains a dictionary with two entries:

  • “head”, which contains the name of the two variables returned by the query,
  • and “results”, which itself contains another dictionary with a “bindings” key, associated with a list of the actual results, each of them being a Python dictionary. Pfew…

Let’s examine one of the results:

{'label': {'value': 'Siège de Pskov', 'type': 'literal', 'xml:lang': 'fr'}, 'item': {'value': 'http://www.wikidata.org/entity/Q815196', 'type': 'uri'}}

It is a dictionary that contains two keys (label and item), each of them having for value another dictionary that has a “value” key associated with, this time, the actual value we want to get. Yay, finally!

Parsing the results

Let’s parse the “bindings” list with a Python “for” loop, so that we can extract the value:

for result in results["results"]["bindings"]:
    qid = result['item']['value'].split('/')[-1]
    label = result['label']['value']

    print(qid, label)

Let me explain the  qid = result[‘item’][‘value’].split(‘/’)[-1]  line: as the item name is stored as a full url (“https://www.wikidata.org/entity/Q815196” and not just “Q815196”), we need to separate each part of it that is between a ‘/’ character. For this, we use the “split()” function of Python, which transforms the string to a Python list containing this:

['https:', '', 'www.wikidata.org', 'entity', 'Q815196']

We only want the last item in the list. In Python, that means the item with the index -1, hence the [-1] at the end of the line. We then store this in the qid variable.

Let’s launch the script:

$ python3 fix-battle-labels.py 
Q815196 Siège de Pskov
Q815207 Siège de Silistra
Q815233 Siège de Tyr
Q608163 Siège de Cracovie
Q1098377 Siège de Narbonne
Q2065069 Siège de Hloukhiv
Q4087405 Siège d'Avaricum
Q2284279 Siège de Fort Pulaski
Q4337397 Siège de Liakhavitchy
Q4337448 Siège de Smolensk
Q701067 Siège de Rhodes
Q7510162 Siège de Cracovie
Q23013145 Siège de Péronne
Q10428014 Siège de Pskov
Q3090571 Siège du Hōjūjidono
Q3485893 Siège de Fukuryūji
Q4118683 Siège d'Algésiras
Q5036985 Siège de Berwick
Q17627724 Siège d'Ilovaïsk
Q815112 Siège d'Antioche

Fixing the issue

We are nearly there! Now what we need is to replace this first proud capital “S” initial by a modest “s”:

label = label[:1].lower() + label[1:]

What is happening here? a Python string works like a list, so we take the part of the string between the beginning of the “label” string and the position after the first character (“label[:1]”) and force it to lower case (“.lower()”). We then concatenate it with the rest of the string (position 1 to the end or “label[1:]”) and assign all this back to the “label” variable.

Last thing, print it in a format that is suitable for QuickStatements:

out = "{}\tLfr\t{}".format(qid, label)
print(out)

That first line seems barbaric? it’s in fact pretty straightforward: “{}\tLfr\t{}” is a string that contains a first placeholder for a variable (“{}”), then a tabulation (“\t”), then the QS keyword for the French label (“Lfr”), then another tabulation and finally the second placeholder for a variable. Then, we use the “format()” function to replace the placeholders with the content of the “qid” and “label” variables. The final script should look like this:

#!/usr/bin/env python3

from SPARQLWrapper import SPARQLWrapper, JSON

endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

sparql = SPARQLWrapper(endpoint)
sparql.setQuery("""
SELECT ?item ?label WHERE {{
  ?item wdt:P31/wdt:P279* wd:Q178561 .
  ?item rdfs:label ?label . FILTER(LANG(?label) = "fr") .
  FILTER(STRSTARTS(?label, "Siège ")) .
}}
""")  # Link to query: http://tinyurl.com/z8bd26h

sparql.setReturnFormat(JSON)

results = sparql.query().convert()

for result in results["results"]["bindings"]:
    qid = result['item']['value'].split('/')[-1]
    label = result['label']['value']

    label = label[:1].lower() + label[1:]

    out = "{}\tLfr\t{}".format(qid, label)
    print(out)

Let’s run it:

$ python3 fix-battle-labels.py 
Q815196	Lfr	siège de Pskov
Q815207	Lfr	siège de Silistra
Q815233	Lfr	siège de Tyr
Q2065069	Lfr	siège de Hloukhiv
Q2284279	Lfr	siège de Fort Pulaski
Q1098377	Lfr	siège de Narbonne
Q608163	Lfr	siège de Cracovie
Q4087405	Lfr	siège d'Avaricum
Q4337397	Lfr	siège de Liakhavitchy
Q4337448	Lfr	siège de Smolensk
Q701067	Lfr	siège de Rhodes
Q10428014	Lfr	siège de Pskov
Q17627724	Lfr	siège d'Ilovaïsk
Q23013145	Lfr	siège de Péronne
Q815112	Lfr	siège d'Antioche
Q3090571	Lfr	siège du Hōjūjidono
Q3485893	Lfr	siège de Fukuryūji
Q4118683	Lfr	siège d'Algésiras
Q5036985	Lfr	siège de Berwick

Yay! All we have to do now is to copy and paste the result to QuickStatements and we are done.

Title picture: Photograph of typefaces by Andreas Praefcke (public domain)

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

3 replies on “Sunday Query : use SPARQL and Python to fix typographical errors on Wikidata”

@GaAs : c’est corrigé, les prochains billets en anglais apparaîtront sur le Planet anglophone. Jusqu’ici les billets en anglais étaient très rares mais maintenant qu’ils sont réguliers, ça valait le coup de mettre en place l’internationalisation du blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.