Category Archives: Wikidata

Happy Birthday, Wikidata!

Wikidata turns four today, and I thought it would be a good time to make a little return on my experience with the project.[1]Plus, to be fair, Auregann asked me to do it.

Wikidata is a project I was enthusiastic for a long time before it went live. I’ve read the name here and there on the “bistro” (village pump) of the French Wikipedia since I started editing Wikipedia in 2005 but the first time it got my attention, IIRC was that discussion in November of 2006.[2]By the way, while searching for this discussion, the eldest mention of Wikidata I found on the French Wikipedia dates from August of 2004! A common repository for data functionning the same way as Commons seemed to me the biggest thing missing to Wikipedia! Another useful thing that was discussed at that time was that it could entirely replace the flawed category system Wikipedia used at the time (oh wait… still uses.) That functionality is one I’m still impatient to see happening.

Time passed and I became more involved in the community, to the point I jumped on the Wikitrain to Gdańsk in 2010 to attend my first Wikimania.

1280px-wiki-train_poznan_dinner
*Record scratch* Yup, that’s me. You’re probably wondering how I ended up in this situation.

 

There, one of the talks I remember the most[3]Apart from the ones with alcool involved. was about a centralized system for sitelinks. That talk made no mention of storing other data than sitelinks and maybe some system to have readable labels in different languages, but that two things became the “Phase I” of Wikidata developpement, and if I check back in my edit log, when I made my first edit was on the 17th of December, 2012, Wikidata was a 7 weeks old baby project that only managed these two things: Wikipedia’s sitelinks and labels. So that’s what I did: I created a new entry, gave it a couple of labels and moved the sitelinks here from Wikipedia.

capture-du-2016-10-28-01-26-15
Ta-da!

And that’s about it at this time. As there was no possibility to add more useful statements at that point, I left the mass import of sitelinks to bots and their masters. After that, and for a long time, I made few edits here and there, but no big projects. At some point, I wanted to do more and asked myself what to do… Back in 2006 or 2007, I remade the list of Emperors of Japan on the French Wikipedia, so I thought it would be a good idea to  have the basic data about them on Wikidata as well (names, dates of birth/reign/death, places of birth/dates, etc.)

I used very few tools at that time, so it was long and tedious (and if it wasn’t for Magnus’ missing props.js and wikidata useful.js[4]Which I personalized for my needs., it would have been a lot harder.), but I managed to see the end of it.

capture-du-2016-10-28-21-33-36

My next burst of activity was sometime later, when I decided to create every single celestial body from the Serenity/Firefly universe.[5]Yeah, in case you didn’t know, I’m a Browncoat. Doing it with Wikidata’s interface only would have been pure madness, so I decided to use another of Magnus’s tools; QuickStatements. If you’ve ever used it, you know that a textarea where to paste a blob of tab separated values isn’t the most practical interface ever, so that time I decided to create a tool of my own: a CSV to QuickStatements convertor that permits to use a more familiar interface.

After that, among other things, I decided to import all xkcd episodes, and also helped Harmonia Amanda with her work on sled dog races and drama schools. This required me to dig further into Python and write scrapers and data processing scripts.

I know you all are xkcd fans.
I know you all are xkcd fans.

In the meantime, I also passed the Semantic Web Mooc from Inria, in which I learnt a lot about a lot of things, and in particular, SPARQL. And that’s when the things went crazy. When the SPARQL endpoint for Wikidata became live, I wrote an article here about SPARQL. People started to ask me questions about it, or to write queries for them, so that article became the first of a series which continues to these days with the #SundayQuery.

If I sum up, in these four years:

  • I started to write my own tools and offer them to the community
  • I created scrapers and converters in Python
  • I learnt a lot about semantic web
  • I became a SPARQL ninja.

All that because of Wikidata. And I hope this project will continue to make me do that type of crazy things.

Now, if you’ll excuse me, I think there is some birthday cake waiting for me.

Header image: Wikidata Birthday Cake by Jason Krüger (CC-BY-SA 4.0)

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Notes   [ + ]

1. Plus, to be fair, Auregann asked me to do it.
2. By the way, while searching for this discussion, the eldest mention of Wikidata I found on the French Wikipedia dates from August of 2004!
3. Apart from the ones with alcool involved.
4. Which I personalized for my needs.
5. Yeah, in case you didn’t know, I’m a Browncoat.

Sunday Query : use SPARQL and Python to fix typographical errors on Wikidata

My turn to make a #SundayQuery! As Harmonia Amanda just said in her own article, I was about explain how to make a Python script to fix the results of her query… But I thought I should start with another script, similar but shorter and easier to understand. The script for Harmonia is here, though.

On Thursday, I published an article about medieval battles, and since then, I did start to fix battle items on Wikidata. One of the most repetitive fixes is the capitalization of the French labels: as they have been imported from Wikipedia, the labels have an unnecessary capital first letter (“Bataille de Saint-Pouilleux en Binouze” instead of “bataille de Saint-Pouilleux en Binouze”)

The query

So first, we need to find all the items that have this typo:

http://tinyurl.com/jljf6xr

Some basic explanations :

  • ?item wdt:P31/wdt:P279* wd:Q178561 .  looks for items that are battles or subclasses of battles, just to be sure I’m not making changes to some book called “Bataille de Perpète-les-Olivettes”…
  • On the next line, I query the labels for the items  ?item rdfs:label ?label .  and filter to keep only those in French FILTER(LANG(?label) = "fr") . . As I need to use the label inside the query and not merely for display (as Harmonia Amanda just explained in her article), I cannot use the wikibase:label, and so I use the semantic web standard rdfs:label.
  • The last line is a FILTER , that keeps only those of the results that matches the function inside it. Here, STRSTARTS  checks if ?label  begins with "Bataille " .

As of the time I write this, running the query returns 3521 results. Far too much to fix it by hand, and I know no tool that already exists and would fix that for me. So, I guess it’s Python time!

The Python script

I love Python. I absolutely love Python. The language is great to put up a useful app within minutes, easily readable (It’s basically English, in fact), not cluttered with gorram series of brackets or semicolons, and generally has great libraries for the things I do the most: scraping webpages, parsing and sorting data, checking ISBNs [1]I hope I’ll be able to write something about it sometime soon. and making websites. Oh and making SPARQL queries of course [2]Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants..

Two snake charmers with a python and a couple of cobras.
Not to mention that the name of the language has a “snake charmer” side 😉

Preliminary thoughts

If you don’t know Python, this article is not the right place to learn it, but there are numerous resources available online [3]For example, https://www.codecademy.com/learn/python or https://docs.python.org/3.5/tutorial/.. Just make sure they are up-to-date and for Python 3. The rest of this articles assumes that you have a basic understanding of Python (indentation, variables, strings, lists, dictionaries, imports and “for” loops.), and that Python 3 and pip are installed on your system.

Why Python 3? Because we’ll handle strings that come from Wikidata and are thus encoded in UTF-8, and Python 2 makes you jump through some loops to use it. Plus, we are in 2016, for Belenos’ sake.

Why pip? because we need a non-standard library to make SPARQL queries, called SPARQLwrapper, and the easiest way to install it is to use this command:

Now, let’s start scripting!

For a start, let’s just query the full list of the sieges [4]I’ve fixed the battles in the meantime 😉 :

That’s quite a bunch of lines, but what does this script do? As we’ll see, most of this will be included in every script that uses a SPARQL query.

  • First, we import two things from the SPARQLWrapper module: the SPARQLWrapper object itself and a “JSON” that it will use later (don’t worry, you won’t have to manipulate json files yourself.)
  • Next, we create a “endpoint” variable, which contains the full URL to the SPARQL endpoint of Wikidata [5]And not the web access to the endpoint, which is just “https://query.wikidata.org/”.
  • Next, we create a SPARQLWrapper object that will use this endpoint to make queries, and put it in a variable simply called “sparql”.
  • We apply the setQuery function to this variable, which is where we put the query we used earlier. Notice that we need to replace { and } by {{ and }} : { and } are reserved characters in Python strings.
  • sparql.setReturnFormat(JSON)  tells the script that what the endpoint will return is formated in  json.
  • results = sparql.query().convert() actually makes the query to the server and converts the response to a Python dictionary called “results”.
  • And for now, we just want to print the result on screen, just to see what we get.

Let’s open a terminal and launch the script:

That’s a bunch of things, but we can see that it contains a dictionary with two entries:

  • “head”, which contains the name of the two variables returned by the query,
  • and “results”, which itself contains another dictionary with a “bindings” key, associated with a list of the actual results, each of them being a Python dictionary. Pfew…

Let’s examine one of the results:

It is a dictionary that contains two keys (label and item), each of them having for value another dictionary that has a “value” key associated with, this time, the actual value we want to get. Yay, finally!

Parsing the results

Let’s parse the “bindings” list with a Python “for” loop, so that we can extract the value:

Let me explain the  qid = result['item']['value'].split('/')[-1]  line: as the item name is stored as a full url (“https://www.wikidata.org/entity/Q815196” and not just “Q815196”), we need to separate each part of it that is between a ‘/’ character. For this, we use the “split()” function of Python, which transforms the string to a Python list containing this:

We only want the last item in the list. In Python, that means the item with the index -1, hence the [-1] at the end of the line. We then store this in the qid variable.

Let’s launch the script:

Fixing the issue

We are nearly there! Now what we need is to replace this first proud capital “S” initial by a modest “s”:

What is happening here? a Python string works like a list, so we take the part of the string between the beginning of the “label” string and the position after the first character (“label[:1]”) and force it to lower case (“.lower()”). We then concatenate it with the rest of the string (position 1 to the end or “label[1:]”) and assign all this back to the “label” variable.

Last thing, print it in a format that is suitable for QuickStatements:

That first line seems barbaric? it’s in fact pretty straightforward: "{}\tLfr\t{}" is a string that contains a first placeholder for a variable (“{}”), then a tabulation (“\t”), then the QS keyword for the French label (“Lfr”), then another tabulation and finally the second placeholder for a variable. Then, we use the “format()” function to replace the placeholders with the content of the “qid” and “label” variables. The final script should look like this:

Let’s run it:

Yay! All we have to do now is to copy and paste the result to QuickStatements and we are done.

Title picture: Photograph of typefaces by Andreas Praefcke (public domain)

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Enregistrer

Notes   [ + ]

1. I hope I’ll be able to write something about it sometime soon.
2. Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants.
3. For example, https://www.codecademy.com/learn/python or https://docs.python.org/3.5/tutorial/.
4. I’ve fixed the battles in the meantime 😉
5. And not the web access to the endpoint, which is just “https://query.wikidata.org/”