Category Archives: Python

Sunday Query : use SPARQL and Python to fix typographical errors on Wikidata

My turn to make a #SundayQuery! As Harmonia Amanda just said in her own article, I was about explain how to make a Python script to fix the results of her query… But I thought I should start with another script, similar but shorter and easier to understand. The script for Harmonia is here, though.

On Thursday, I published an article about medieval battles, and since then, I did start to fix battle items on Wikidata. One of the most repetitive fixes is the capitalization of the French labels: as they have been imported from Wikipedia, the labels have an unnecessary capital first letter (“Bataille de Saint-Pouilleux en Binouze” instead of “bataille de Saint-Pouilleux en Binouze”)

The query

So first, we need to find all the items that have this typo:

http://tinyurl.com/jljf6xr

Some basic explanations :

  • ?item wdt:P31/wdt:P279* wd:Q178561 .  looks for items that are battles or subclasses of battles, just to be sure I’m not making changes to some book called “Bataille de Perpète-les-Olivettes”…
  • On the next line, I query the labels for the items  ?item rdfs:label ?label .  and filter to keep only those in French FILTER(LANG(?label) = "fr") . . As I need to use the label inside the query and not merely for display (as Harmonia Amanda just explained in her article), I cannot use the wikibase:label, and so I use the semantic web standard rdfs:label.
  • The last line is a FILTER , that keeps only those of the results that matches the function inside it. Here, STRSTARTS  checks if ?label  begins with "Bataille " .

As of the time I write this, running the query returns 3521 results. Far too much to fix it by hand, and I know no tool that already exists and would fix that for me. So, I guess it’s Python time!

The Python script

I love Python. I absolutely love Python. The language is great to put up a useful app within minutes, easily readable (It’s basically English, in fact), not cluttered with gorram series of brackets or semicolons, and generally has great libraries for the things I do the most: scraping webpages, parsing and sorting data, checking ISBNs [1]I hope I’ll be able to write something about it sometime soon. and making websites. Oh and making SPARQL queries of course [2]Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants..

Two snake charmers with a python and a couple of cobras.
Not to mention that the name of the language has a “snake charmer” side 😉

Preliminary thoughts

If you don’t know Python, this article is not the right place to learn it, but there are numerous resources available online [3]For example, https://www.codecademy.com/learn/python or https://docs.python.org/3.5/tutorial/.. Just make sure they are up-to-date and for Python 3. The rest of this articles assumes that you have a basic understanding of Python (indentation, variables, strings, lists, dictionaries, imports and “for” loops.), and that Python 3 and pip are installed on your system.

Why Python 3? Because we’ll handle strings that come from Wikidata and are thus encoded in UTF-8, and Python 2 makes you jump through some loops to use it. Plus, we are in 2016, for Belenos’ sake.

Why pip? because we need a non-standard library to make SPARQL queries, called SPARQLwrapper, and the easiest way to install it is to use this command:

Now, let’s start scripting!

For a start, let’s just query the full list of the sieges [4]I’ve fixed the battles in the meantime 😉 :

That’s quite a bunch of lines, but what does this script do? As we’ll see, most of this will be included in every script that uses a SPARQL query.

  • First, we import two things from the SPARQLWrapper module: the SPARQLWrapper object itself and a “JSON” that it will use later (don’t worry, you won’t have to manipulate json files yourself.)
  • Next, we create a “endpoint” variable, which contains the full URL to the SPARQL endpoint of Wikidata [5]And not the web access to the endpoint, which is just “https://query.wikidata.org/”.
  • Next, we create a SPARQLWrapper object that will use this endpoint to make queries, and put it in a variable simply called “sparql”.
  • We apply the setQuery function to this variable, which is where we put the query we used earlier. Notice that we need to replace { and } by {{ and }} : { and } are reserved characters in Python strings.
  • sparql.setReturnFormat(JSON)  tells the script that what the endpoint will return is formated in  json.
  • results = sparql.query().convert() actually makes the query to the server and converts the response to a Python dictionary called “results”.
  • And for now, we just want to print the result on screen, just to see what we get.

Let’s open a terminal and launch the script:

That’s a bunch of things, but we can see that it contains a dictionary with two entries:

  • “head”, which contains the name of the two variables returned by the query,
  • and “results”, which itself contains another dictionary with a “bindings” key, associated with a list of the actual results, each of them being a Python dictionary. Pfew…

Let’s examine one of the results:

It is a dictionary that contains two keys (label and item), each of them having for value another dictionary that has a “value” key associated with, this time, the actual value we want to get. Yay, finally!

Parsing the results

Let’s parse the “bindings” list with a Python “for” loop, so that we can extract the value:

Let me explain the  qid = result['item']['value'].split('/')[-1]  line: as the item name is stored as a full url (“https://www.wikidata.org/entity/Q815196” and not just “Q815196”), we need to separate each part of it that is between a ‘/’ character. For this, we use the “split()” function of Python, which transforms the string to a Python list containing this:

We only want the last item in the list. In Python, that means the item with the index -1, hence the [-1] at the end of the line. We then store this in the qid variable.

Let’s launch the script:

Fixing the issue

We are nearly there! Now what we need is to replace this first proud capital “S” initial by a modest “s”:

What is happening here? a Python string works like a list, so we take the part of the string between the beginning of the “label” string and the position after the first character (“label[:1]”) and force it to lower case (“.lower()”). We then concatenate it with the rest of the string (position 1 to the end or “label[1:]”) and assign all this back to the “label” variable.

Last thing, print it in a format that is suitable for QuickStatements:

That first line seems barbaric? it’s in fact pretty straightforward: "{}\tLfr\t{}" is a string that contains a first placeholder for a variable (“{}”), then a tabulation (“\t”), then the QS keyword for the French label (“Lfr”), then another tabulation and finally the second placeholder for a variable. Then, we use the “format()” function to replace the placeholders with the content of the “qid” and “label” variables. The final script should look like this:

Let’s run it:

Yay! All we have to do now is to copy and paste the result to QuickStatements and we are done.

Title picture: Photograph of typefaces by Andreas Praefcke (public domain)

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Notes   [ + ]

1. I hope I’ll be able to write something about it sometime soon.
2. Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants.
3. For example, https://www.codecademy.com/learn/python or https://docs.python.org/3.5/tutorial/.
4. I’ve fixed the battles in the meantime 😉
5. And not the web access to the endpoint, which is just “https://query.wikidata.org/”