Sunday Query : use SPARQL and Python to fix typographical errors on Wikidata

My turn to make a #SundayQuery! As Harmonia Amanda just said in her own article, I was about explain how to make a Python script to fix the results of her query… But I thought I should start with another script, similar but shorter and easier to understand. The script for Harmonia is here, though.

On Thursday, I published an article about medieval battles, and since then, I did start to fix battle items on Wikidata. One of the most repetitive fixes is the capitalization of the French labels: as they have been imported from Wikipedia, the labels have an unnecessary capital first letter (“Bataille de Saint-Pouilleux en Binouze” instead of “bataille de Saint-Pouilleux en Binouze”)

The query

So first, we need to find all the items that have this typo:


Some basic explanations :

  • ?item wdt:P31/wdt:P279* wd:Q178561 .  looks for items that are battles or subclasses of battles, just to be sure I’m not making changes to some book called “Bataille de Perpète-les-Olivettes”…
  • On the next line, I query the labels for the items  ?item rdfs:label ?label .  and filter to keep only those in French FILTER(LANG(?label) = "fr") . . As I need to use the label inside the query and not merely for display (as Harmonia Amanda just explained in her article), I cannot use the wikibase:label, and so I use the semantic web standard rdfs:label.
  • The last line is a FILTER , that keeps only those of the results that matches the function inside it. Here, STRSTARTS  checks if ?label  begins with "Bataille " .

As of the time I write this, running the query returns 3521 results. Far too much to fix it by hand, and I know no tool that already exists and would fix that for me. So, I guess it’s Python time!

The Python script

I love Python. I absolutely love Python. The language is great to put up a useful app within minutes, easily readable (It’s basically English, in fact), not cluttered with gorram series of brackets or semicolons, and generally has great libraries for the things I do the most: scraping webpages, parsing and sorting data, checking ISBNs [1]I hope I’ll be able to write something about it sometime soon. and making websites. Oh and making SPARQL queries of course [2]Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants..

Two snake charmers with a python and a couple of cobras.
Not to mention that the name of the language has a “snake charmer” side 😉

Preliminary thoughts

If you don’t know Python, this article is not the right place to learn it, but there are numerous resources available online [3]For example, https://www.codecademy.com/learn/python or https://docs.python.org/3.5/tutorial/.. Just make sure they are up-to-date and for Python 3. The rest of this articles assumes that you have a basic understanding of Python (indentation, variables, strings, lists, dictionaries, imports and “for” loops.), and that Python 3 and pip are installed on your system.

Why Python 3? Because we’ll handle strings that come from Wikidata and are thus encoded in UTF-8, and Python 2 makes you jump through some loops to use it. Plus, we are in 2016, for Belenos’ sake.

Why pip? because we need a non-standard library to make SPARQL queries, called SPARQLwrapper, and the easiest way to install it is to use this command:

Now, let’s start scripting!

For a start, let’s just query the full list of the sieges [4]I’ve fixed the battles in the meantime 😉 :

That’s quite a bunch of lines, but what does this script do? As we’ll see, most of this will be included in every script that uses a SPARQL query.

  • First, we import two things from the SPARQLWrapper module: the SPARQLWrapper object itself and a “JSON” that it will use later (don’t worry, you won’t have to manipulate json files yourself.)
  • Next, we create a “endpoint” variable, which contains the full URL to the SPARQL endpoint of Wikidata [5]And not the web access to the endpoint, which is just “https://query.wikidata.org/”.
  • Next, we create a SPARQLWrapper object that will use this endpoint to make queries, and put it in a variable simply called “sparql”.
  • We apply the setQuery function to this variable, which is where we put the query we used earlier. Notice that we need to replace { and } by {{ and }} : { and } are reserved characters in Python strings.
  • sparql.setReturnFormat(JSON)  tells the script that what the endpoint will return is formated in  json.
  • results = sparql.query().convert() actually makes the query to the server and converts the response to a Python dictionary called “results”.
  • And for now, we just want to print the result on screen, just to see what we get.

Let’s open a terminal and launch the script:

That’s a bunch of things, but we can see that it contains a dictionary with two entries:

  • “head”, which contains the name of the two variables returned by the query,
  • and “results”, which itself contains another dictionary with a “bindings” key, associated with a list of the actual results, each of them being a Python dictionary. Pfew…

Let’s examine one of the results:

It is a dictionary that contains two keys (label and item), each of them having for value another dictionary that has a “value” key associated with, this time, the actual value we want to get. Yay, finally!

Parsing the results

Let’s parse the “bindings” list with a Python “for” loop, so that we can extract the value:

Let me explain the  qid = result['item']['value'].split('/')[-1]  line: as the item name is stored as a full url (“https://www.wikidata.org/entity/Q815196” and not just “Q815196”), we need to separate each part of it that is between a ‘/’ character. For this, we use the “split()” function of Python, which transforms the string to a Python list containing this:

We only want the last item in the list. In Python, that means the item with the index -1, hence the [-1] at the end of the line. We then store this in the qid variable.

Let’s launch the script:

Fixing the issue

We are nearly there! Now what we need is to replace this first proud capital “S” initial by a modest “s”:

What is happening here? a Python string works like a list, so we take the part of the string between the beginning of the “label” string and the position after the first character (“label[:1]”) and force it to lower case (“.lower()”). We then concatenate it with the rest of the string (position 1 to the end or “label[1:]”) and assign all this back to the “label” variable.

Last thing, print it in a format that is suitable for QuickStatements:

That first line seems barbaric? it’s in fact pretty straightforward: "{}\tLfr\t{}" is a string that contains a first placeholder for a variable (“{}”), then a tabulation (“\t”), then the QS keyword for the French label (“Lfr”), then another tabulation and finally the second placeholder for a variable. Then, we use the “format()” function to replace the placeholders with the content of the “qid” and “label” variables. The final script should look like this:

Let’s run it:

Yay! All we have to do now is to copy and paste the result to QuickStatements and we are done.

Title picture: Photograph of typefaces by Andreas Praefcke (public domain)















Notes   [ + ]

1. I hope I’ll be able to write something about it sometime soon.
2. Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants.
3. For example, https://www.codecademy.com/learn/python or https://docs.python.org/3.5/tutorial/.
4. I’ve fixed the battles in the meantime 😉
5. And not the web access to the endpoint, which is just “https://query.wikidata.org/”

3 thoughts on “Sunday Query : use SPARQL and Python to fix typographical errors on Wikidata”

  1. @GaAs : c’est corrigé, les prochains billets en anglais apparaîtront sur le Planet anglophone. Jusqu’ici les billets en anglais étaient très rares mais maintenant qu’ils sont réguliers, ça valait le coup de mettre en place l’internationalisation du blog.

Leave a Reply

Your email address will not be published. Required fields are marked *