Wikidata Wikimedia projects

Happy Birthday, Wikidata!

Wikidata turns four today, and I thought it would be a good time to make a little return on my experience with the project.((Plus, to be fair, Auregann asked me to do it.))

Wikidata is a project I was enthusiastic for a long time before it went live. I’ve read the name here and there on the “bistro” (village pump) of the French Wikipedia since I started editing Wikipedia in 2005 but the first time it got my attention, IIRC was that discussion in November of 2006.((By the way, while searching for this discussion, the eldest mention of Wikidata I found on the French Wikipedia dates from August of 2004!)) A common repository for data functionning the same way as Commons seemed to me the biggest thing missing to Wikipedia! Another useful thing that was discussed at that time was that it could entirely replace the flawed category system Wikipedia used at the time (oh wait… still uses.) That functionality is one I’m still impatient to see happening.

Time passed and I became more involved in the community, to the point I jumped on the Wikitrain to Gdańsk in 2010 to attend my first Wikimania.

*Record scratch* Yup, that’s me. You’re probably wondering how I ended up in this situation.


There, one of the talks I remember the most((Apart from the ones with alcool involved.)) was about a centralized system for sitelinks. That talk made no mention of storing other data than sitelinks and maybe some system to have readable labels in different languages, but that two things became the “Phase I” of Wikidata developpement, and if I check back in my edit log, when I made my first edit was on the 17th of December, 2012, Wikidata was a 7 weeks old baby project that only managed these two things: Wikipedia’s sitelinks and labels. So that’s what I did: I created a new entry, gave it a couple of labels and moved the sitelinks here from Wikipedia.


And that’s about it at this time. As there was no possibility to add more useful statements at that point, I left the mass import of sitelinks to bots and their masters. After that, and for a long time, I made few edits here and there, but no big projects. At some point, I wanted to do more and asked myself what to do… Back in 2006 or 2007, I remade the list of Emperors of Japan on the French Wikipedia, so I thought it would be a good idea to  have the basic data about them on Wikidata as well (names, dates of birth/reign/death, places of birth/dates, etc.)

I used very few tools at that time, so it was long and tedious (and if it wasn’t for Magnus’ missing props.js and wikidata useful.js((Which I personalized for my needs.)), it would have been a lot harder.), but I managed to see the end of it.


My next burst of activity was sometime later, when I decided to create every single celestial body from the Serenity/Firefly universe.((Yeah, in case you didn’t know, I’m a Browncoat.)) Doing it with Wikidata’s interface only would have been pure madness, so I decided to use another of Magnus’s tools; QuickStatements. If you’ve ever used it, you know that a textarea where to paste a blob of tab separated values isn’t the most practical interface ever, so that time I decided to create a tool of my own: a CSV to QuickStatements convertor that permits to use a more familiar interface.

After that, among other things, I decided to import all xkcd episodes, and also helped Harmonia Amanda with her work on sled dog races and drama schools. This required me to dig further into Python and write scrapers and data processing scripts.

I know you all are xkcd fans.
I know you all are xkcd fans.

In the meantime, I also passed the Semantic Web Mooc from Inria, in which I learnt a lot about a lot of things, and in particular, SPARQL. And that’s when the things went crazy. When the SPARQL endpoint for Wikidata became live, I wrote an article here about SPARQL. People started to ask me questions about it, or to write queries for them, so that article became the first of a series which continues to these days with the #SundayQuery.

If I sum up, in these four years:

  • I started to write my own tools and offer them to the community
  • I created scrapers and converters in Python
  • I learnt a lot about semantic web
  • I became a SPARQL ninja.

All that because of Wikidata. And I hope this project will continue to make me do that type of crazy things.

Now, if you’ll excuse me, I think there is some birthday cake waiting for me.

Header image: Wikidata Birthday Cake by Jason Krüger (CC-BY-SA 4.0)









Python Wikidata Wikimedia projects

Sunday Query : use SPARQL and Python to fix typographical errors on Wikidata

My turn to make a #SundayQuery! As Harmonia Amanda just said in her own article, I was about explain how to make a Python script to fix the results of her query… But I thought I should start with another script, similar but shorter and easier to understand. The script for Harmonia is here, though.

On Thursday, I published an article about medieval battles, and since then, I did start to fix battle items on Wikidata. One of the most repetitive fixes is the capitalization of the French labels: as they have been imported from Wikipedia, the labels have an unnecessary capital first letter (“Bataille de Saint-Pouilleux en Binouze” instead of “bataille de Saint-Pouilleux en Binouze”)

The query

So first, we need to find all the items that have this typo:

Some basic explanations :

  • ?item wdt:P31/wdt:P279* wd:Q178561 .  looks for items that are battles or subclasses of battles, just to be sure I’m not making changes to some book called “Bataille de Perpète-les-Olivettes”…
  • On the next line, I query the labels for the items  ?item rdfs:label ?label .  and filter to keep only those in French FILTER(LANG(?label) = "fr") . . As I need to use the label inside the query and not merely for display (as Harmonia Amanda just explained in her article), I cannot use the wikibase:label, and so I use the semantic web standard rdfs:label.
  • The last line is a FILTER , that keeps only those of the results that matches the function inside it. Here, STRSTARTS  checks if ?label  begins with "Bataille " .

As of the time I write this, running the query returns 3521 results. Far too much to fix it by hand, and I know no tool that already exists and would fix that for me. So, I guess it’s Python time!

The Python script

I love Python. I absolutely love Python. The language is great to put up a useful app within minutes, easily readable (It’s basically English, in fact), not cluttered with gorram series of brackets or semicolons, and generally has great libraries for the things I do the most: scraping webpages, parsing and sorting data, checking ISBNs ((I hope I’ll be able to write something about it sometime soon.)) and making websites. Oh and making SPARQL queries of course ((Plus, the examples in the official documentation are Firefly-based. Yes sir, Captain Tightpants.)).

Two snake charmers with a python and a couple of cobras.
Not to mention that the name of the language has a “snake charmer” side 😉

Preliminary thoughts

If you don’t know Python, this article is not the right place to learn it, but there are numerous resources available online ((For example, or Just make sure they are up-to-date and for Python 3. The rest of this articles assumes that you have a basic understanding of Python (indentation, variables, strings, lists, dictionaries, imports and “for” loops.), and that Python 3 and pip are installed on your system.

Why Python 3? Because we’ll handle strings that come from Wikidata and are thus encoded in UTF-8, and Python 2 makes you jump through some loops to use it. Plus, we are in 2016, for Belenos’ sake.

Why pip? because we need a non-standard library to make SPARQL queries, called SPARQLwrapper, and the easiest way to install it is to use this command:

Now, let’s start scripting!

For a start, let’s just query the full list of the sieges ((I’ve fixed the battles in the meantime 😉 )):

That’s quite a bunch of lines, but what does this script do? As we’ll see, most of this will be included in every script that uses a SPARQL query.

  • First, we import two things from the SPARQLWrapper module: the SPARQLWrapper object itself and a “JSON” that it will use later (don’t worry, you won’t have to manipulate json files yourself.)
  • Next, we create a “endpoint” variable, which contains the full URL to the SPARQL endpoint of Wikidata ((And not the web access to the endpoint, which is just “”)).
  • Next, we create a SPARQLWrapper object that will use this endpoint to make queries, and put it in a variable simply called “sparql”.
  • We apply the setQuery function to this variable, which is where we put the query we used earlier. Notice that we need to replace { and } by {{ and }} : { and } are reserved characters in Python strings.
  • sparql.setReturnFormat(JSON)  tells the script that what the endpoint will return is formated in  json.
  • results = sparql.query().convert() actually makes the query to the server and converts the response to a Python dictionary called “results”.
  • And for now, we just want to print the result on screen, just to see what we get.

Let’s open a terminal and launch the script:

That’s a bunch of things, but we can see that it contains a dictionary with two entries:

  • “head”, which contains the name of the two variables returned by the query,
  • and “results”, which itself contains another dictionary with a “bindings” key, associated with a list of the actual results, each of them being a Python dictionary. Pfew…

Let’s examine one of the results:

It is a dictionary that contains two keys (label and item), each of them having for value another dictionary that has a “value” key associated with, this time, the actual value we want to get. Yay, finally!

Parsing the results

Let’s parse the “bindings” list with a Python “for” loop, so that we can extract the value:

Let me explain the  qid = result['item']['value'].split('/')[-1]  line: as the item name is stored as a full url (“” and not just “Q815196”), we need to separate each part of it that is between a ‘/’ character. For this, we use the “split()” function of Python, which transforms the string to a Python list containing this:

We only want the last item in the list. In Python, that means the item with the index -1, hence the [-1] at the end of the line. We then store this in the qid variable.

Let’s launch the script:

Fixing the issue

We are nearly there! Now what we need is to replace this first proud capital “S” initial by a modest “s”:

What is happening here? a Python string works like a list, so we take the part of the string between the beginning of the “label” string and the position after the first character (“label[:1]”) and force it to lower case (“.lower()”). We then concatenate it with the rest of the string (position 1 to the end or “label[1:]”) and assign all this back to the “label” variable.

Last thing, print it in a format that is suitable for QuickStatements:

That first line seems barbaric? it’s in fact pretty straightforward: "{}\tLfr\t{}" is a string that contains a first placeholder for a variable (“{}”), then a tabulation (“\t”), then the QS keyword for the French label (“Lfr”), then another tabulation and finally the second placeholder for a variable. Then, we use the “format()” function to replace the placeholders with the content of the “qid” and “label” variables. The final script should look like this:

Let’s run it:

Yay! All we have to do now is to copy and paste the result to QuickStatements and we are done.

Title picture: Photograph of typefaces by Andreas Praefcke (public domain)
















Sunday Query: all surnames marked as disambiguation pages, with an English Wikipedia link and with “(surname)” in their English label

It’s Sunday again! Time for the queries! Last week I showed you the basics of SPARQL; this week I wanted to show you how we could use SPARQL to do maintenance work. I assume you now understand the use of PREFIX, SELECT, WHERE.

I have been a member of the WikiProject:Names for years. When I’m not working on Broadway and the Royal Academy of Dramatic Art archives, ((Yes, I really want you to read this one.)) I am one of the people who ensure that “given name:Christopher (Iowa)” is transformed back to “given name:Christopher (given name)”. Over the last weeks I’ve corrected thousands of wrong uses of the given name/family name properties, and for this work, I used dozens of SPARQL queries. I thought it could be interesting to show how I used SPARQL to create a list of strictly identical errors that I could then treat automatically.

What do we search?

If you read the constraints violations reports, you’ll see that the more frequent error for the property “family name” (P734) is the use of a disambiguation page as value instead of a family name. We can do a query like that:

link to the query. The results are in the thousands. Sigh.

But then we find something more interesting: there are entities which are both a disambiguation page and a family name. What?! That’s ontologically wrong. To use the wrong value as family name is human error; but an entity can’t be both a specific type of Wikimedia page and a family name. It’s like saying a person could as well be a book. Ontologically absurd. So all items with both P31 need to be corrected. How many are there?

link to the query.
Several thousands again. Actually, there are more entities which are both a disambiguation page and a family name than there are person using disambiguation pages as family names. This means there are family names/disambiguation pages in the database which aren’t used. They’re still wrong but it doesn’t show in violation constraints reports.

If we explore, we see than there are different cases out there: some of the family names/disambiguation pages are in reality disambiguation pages, some are family names, some are both (they link to articles on different Wikipedia, some about a disambiguation page and some about a family name; these need to be separated). Too many different possibilities: we can’t automatize the correction. Well… maybe we can’t.

Restraining our search

If we can’t treat in one go all disambig/family name pages, maybe we can ask a more precise question. In our long list of violations, I asked for English label and I found some disbelieving ones. There were items named “Poe (surname)”. As disambiguation pages. That’s a wrong use of label, which shouldn’t have precisions about the subject in brackets (that’s what the description is for) but if they are about a surname they shouldn’t be disambiguation pages too! So, so wrong.

Querying labels

But still, the good news! We can isolate these entries. For that we’ll have to query not the relations between items but the labels of the items. Until now, we had used the SERVICE wikibase:label workaround, a tool which only exists on the Wikidata endpoint, because it was really easy and we only wanted to have human-readable results, not really to query labels. But now that we want to, the workaround isn’t enough, we’ll need to do it the real SPARQL way, using rdfs.

Our question now is: can I list all items which are both family names and disambiguation pages, whose English label contains “(surname)”?

link to the query. We had several hundreds results. ((Then. We had several hundreds result then. I’m happy to say it isn’t true now.)) You should observe the changes I made in the SELECT DISTINCT as I don’t use the SERVICE wikibase:label workaround.

Querying sitelinks

Can we automatize correction now? Well… no. There is still problems. In this list, there are items which have links to several Wikipedia, the English one about the surname and the other(s) ones about a disambiguation page. Worse, there are items which don’t have an English interwiki any longer, because it was deleted or linked to another item (like the “real” family name item) and the wrong English label persisted. Si maybe we can filter our list to only items with a link to the English Wikipedia. For this, we’ll use schema.

link to the query. Well, that’s better! But our problem is still here: if they have several sitelinks, maybe the other(s) sitelink are not about the family name. So we want the items with an English interwiki and only an English interwiki. Like this:

link to the query.

Several things: we separated ?sitelink and ?WParticle. We use ?sitelink to query the number of sitelinks, and ?WParticle to query the particular of this sitelink. Note that we need to use GROUP BY, like last week.

Polishing of the query

Just to be on the safe side (we are never safe enough before automatizing corrections) we’ll also check that all items on our list are only family name/disambiguation pages; they’re not also marked as a location or something equally strange. So we query that they have only two P31 (instance of), these two being defined as Q101352 (family name) and Q4167410 (disambiguation page).

link to the query.

It should give you a beautiful “no matching records found”. Yesterday, it gave me 175 items which I knew I could correct automatically. Which I have done, with a python script made by Ash_Crow. If you are good, he’ll make a #MondayScript in response to this #SundayQuery!

(Main picture: Name List of Abhiseka – Public Domain, photograph done by Kūkai.)


Sunday Query: The 200 Oldest Living French Actresses

Hello, here’s Harmonia Amanda again (I think Ash_Crow just agreed to let me squat his blog indefinitely). This article shouldn’t be too long, for once. ((Well, nothing to compare to my previous one, which all of you should have read.)) It started as an hashtag on twitter #SundayQuery, where I wrote SPARQL queries for people who couldn’t find how to ask their question. So this article, and the next ones if I keep this up, are basically a kind of “how to translate a question in SPARQL step by step” tutorial.

The question this week was asked by Jean-No: who are the oldest French actresses still alive?

To begin: the endpoint and PREFIX…

SPARQL is the language you use to query a semantic database. For that you can use an endpoint, which are services that accept SPARQL queries and return results. There are many SPARQL endpoints around, but we will be using the Wikidata endpoint.

A semantic base is constituted of triples of information: subject, predicate, object (in Wikidata, we usually call that item, property, value, but it’s the exact same thing). As querying with full URIs all the time would be really tedious, SPARQL needs PREFIX.

These are the standard prefixes used to query Wikidata. If used, they should be stated at the top of the query. You can add other prefixes as needed to. Actually, the Wikidata endpoint has been created to specifically query Wikidata, so all these prefixes are already declared, even if you don’t see them. But it’s not because they aren’t visible that they aren’t there, so always remember: SPARQL NEEDS PREFIX. There.

All French women

The first thing to declare after the prefixes, is what we are asking to have for results. We can use the command SELECT or the command SELECT DISTINCT (this one ask to clear off duplicates) and then listing the variables.

As we are looking for humans, we’ll call this variable “person” (but we could choose anything we want).

We will then define the conditions we want our variable to respond to, with WHERE. We are seeking the oldest French actresses still alive. We need to cut that in little understandable bits. So the first step is, we are querying for human beings (and not fictional actresses). Humans beings are all items which have the property “instance of” (P31) with the value “human” (Q5).


This query will return all humans in the database. ((Well, if it doesn’t timeout through the sheer number.)) But we don’t want all humans, we want only the female ones. So we want humans who also answer to “sex or gender” (P21) with the value “female” (Q6581072).

And we want them French as well! So with “country of citizenship” (P27) with the value “France” (Q142).

Here it is, all French women. We could also write the query this way:

It’s exactly the same query.

All French actresses

Well that seems like a good beginning, but we don’t want all women, we only want actresses. It could be as simple as “occupation” (P106) “actor” (Q33999) but it’s not. In reality Q33999 doesn’t cover all actors and actresses: it’s a class. Subclasses, like “stage actor”, “television actor”, “film actor” have all “subclass of” (P279) “actor” (Q33999). We don’t only want the humans with occupation:actor; we also want those with occupation:stage actor and so on.

So we need to introduce another variable, which I’ll call ?occupation.

Actually, if we really wanted to avoid the ?occupation variable, we could have written:

I’s the same thing, but I found it less clear for beginners.

Still alive

So now we need to filter by age. The first step is to ensure they have a birth date (P569). As we don’t care (for now) what this birth date is, we’ll introduce a new variable instead of setting a value.

Now we want the French actresses still alive, which we’ll translate to “without a ‘date of death’ (P570)”. For that, we’ll use a filter:

Here it is, all French actresses still alive!

The Oldest

To find the oldest, we’ll need to order our results. So after the WHERE section of our query, we’ll add another request: now that you have found me my results, SPARQL, can you order them as I want? Logically, this is called ORDER BY. We can “order by” the variable we want. We could order by ?person, but as it’s the only variable we have selected in our query, it’s already done. To order by a variable, we need to ask to select this variable first.

The obvious way would be to order by ?birthDate. The thing is, Wikidata sometimes has more than one birth date for people because of conflicting sources, which translates in the same people appearing twice. So, we’ll group the people by their Qid (using GROUP BY) si that duplicates exist now as a group. Then we use SAMPLE (in our SELECT) to take only the groups’ first birth date that we find… And then ORDER BY it :

This gives us all French actresses ordered by date of birth, so the oldest first. We can already see that we have problems with the data: actresses whom we don’t know the date of birth (with “unknown value”) and actresses manifestly dead centuries ago but whom we don’t know the date of death. But still! Here is our query and we answered Jean-No’s question as we could.

With labels

Well, we certainly answered the query but this big list of Q and numbers isn’t very human-readable. We should ask to have the labels too! We could do this the proper SPARQL way, with using RDFS and such, but we are querying Wikidata on the Wikidata endpoint, so we’ll use the local tool instead.

We add to the query:

which means we ask to have the labels in French (as we are querying French people), and if it doesn’t exist in French, then in English. But just adding that doesn’t work: we need to SELECT it too! (and to add it to the GROUP BY). So:

Tada! It’s much more understandable now!

Make it pretty

I don’t want all living French actresses, including the children, I want the oldest; so I add a LIMIT to the results. ((And it will be quicker that way.))

And why should I have a table with identifiers and names when I could have the results as a timeline? We ask that at the top, even before the SELECT: ((Beware that this doesn’t work on all SPARQL endpoints out there, even if it exists on the Wikidata one.))

Wait, wait! Can I have pictures too? But only if they have a picture, I still want the results if they don’t have one (and I want only one picture if they happen to have several so I use again the SAMPLE). YES WE CAN:

Here the link to the results. I hope you had fun with this SundayQuery!

(Main picture: Mademoiselle Ambroisine – created between 1832 and 1836 by an artist whose name is unreadable – Public Domain.)


Ben Whishaw, Broadway, the RADA and Wikidata (in English and with updates)

Hello everyone! Here is Harmonia Amanda, squatting Ash_Crow’s blog. Some people told me repeatedly I should write about some of what I did these last few months on Wikidata, e. g. all my work about the RADA (Royal Academy of Dramatic Art) and other things. And after I wrote it in French, some people told me I should write it again in English. So here we are! To ensure that no one will read it, I wrote a long text, stuffed with footnotes, ((Like this one.)) and even with real SPARQL queries ((If you went for the SPARQL queries, you should know that they were made for the Wikidata endpoint where most PREFIX are already declared.)) here and there. No need to thank me. ((But if you have already read it in French, you can still read it again: I updated it!))

How it begins: The Hollow Crown

Everything is because of Ben Whishaw. I was quietly watching Shakespeare’s adaptations by the BBC (and for those who haven’t watched The Hollow Crown, I suggest to do so) and I was thinking that the actor playing Richard II deserved an award for his role, because he was simply extraordinary. ((He isn’t the only one, Rory Kinnear is also excellent in Bolingbroke (and Patrick Stewart makes a terrific Gaunt ♥) but Bolingbroke is interesting in Shakespeare’s play, meanwhile Richard II is the guy who declaims too long monologues without being clearly a good guy we could attach ourselves to or a bad guy we could unashamedly hate. In my experience, in the previous adaptations I saw, it was either a very tiring character or a character played so badly it became funny. He really develops as a character in the fourth act which is a bit late, to be honest. But Whishaw as a child-king-turned-into-an-adult-but-not-really, sometimes capricious, sometimes christic, always mercurial, made me believe in this character long before the Fourth Act. I could write an entire blog article about The Hollow Crown, actors, costumes and sets (and the cinematography! it was really good too) but I’m theoretically here to speak about Wikidata and the RADA and you’ll see it’ll be long in coming…)) ((Actually, he did receive a BAFTA award for this role so my opinion was somewhat shared, it seems.)) So I went lurking on his French Wikipedia page ((Yes, I’m French. In case the fact I wrote the first version of this text in French escaped you.)) and as a good Wikimedian, ((Yes, I’m not only French but a Wikimedian one at that.)) I decided to make it a little bit better. For now ((Yes, I’m collecting sources and references to redact completely anew the article but first of all, I never wrote an article about a living person, and secondly, I was somewhat occupied since then, as you’ll see if you pursue your reading.)) I’ve mostly cleaned up the wikicode and dealt with accessibility for blind-reading software. As I couldn’t instantaneously make it a featured article, I thought it could be fun to complete his Wikidata entry. That was the beginning. As I said, everything is because of Ben Whishaw.

Ben Whishaw in 2008 by KikeValencia – CC-BY-SA

Wikidata : the easy beginning

Wikidata is a free knowledge database with some twenty million entries, under a free license. It’s not made to be directly read by humans (although they can) ((Actually, if you want to read Wikidata and not modify it, I strongly suggest you to use Reasonator, whose slogan is “Wikidata in pretty”.)) but to be machine-readable, and to be used in other projects through visualisation tools. ((If you don’t know Wikidata at all and are curious, this Commons category hosts several presentations.)) I am an experienced Wikidatian by now so, at first, working on Whishaw’s entry seemed easy.

I just had to add more precise occupations (he isn’t just an “actor”, he is a “stage actor”, a “television actor”, a “film actor”…). He received many awards, which should all be listed (P166), as well as for each of them the information about the year it was awarded (P585) and for which work P1686) and even sometimes with whom the award was shared (P1706). And I could do the same work for all the awards he was nominated for (P1411) but didn’t receive. Then I could also list all his roles, which we don’t add to his Wikidata entry but on the works’ entries using “P161 (cast member)” with “Q342617 (Ben Whishaw)” as value. Sometimes we can even use qualifiers, like “P453 (character role)” when the characters themselves have a Wikidata entry (like Q in James Bond). ((Yes, because Whishaw also played in James Bond and by the way there are many Shakespearean actresses and actors in the last James Bond’s films.))

Wikidata screenshot
Wikidata screenshot

So far, so easy. Well, the thing is, Whishaw is primarily a stage actor. I mean, he became well-known for his heartbreaking interpretation of Hamlet at 23 at the Old Vic. ((Yes, because Richard II wasn’t even his first Shakespearean role, and not even his second, as he played Ariel in the film adaptation of The Tempest with Helen Mirren as Prosper(a). It isn’t a really good film, frankly, but it needs to be seen because Mirren and Whishaw.)) It’s a bit strange to see all his TV and film roles listed and not his theatrical ones (Mojo, Bakkhai…). So I started digging about theatre on Wikidata and let me tell you… it’s at least as much under-treated and messy than on Wikipedia! Which is saying something. ((It’s very marginally better on the English Wikipedia than it is on the French Wikipedia, but my point still stands.))

Old Vic Theatre by MrsEllacott – CC-BY-SA 3.0.

Here would be the perfect place to speak about ontologies, semantic web and the questions of knowledge organisation but the consensus between my beta-readers is that my article is already too long and I should focus about the RADA (which is a long time coming) and speak of everything else another time. ((And honestly you don’t really need to understand all of that to understand what I do, or even to do the same yourself. It’s fascinating and compelling and if you are interested that’s pretty great, but don’t be frightened or put out if you aren’t.))

The Internet Broadway Database

While I was thinking about the relations between “art”, “work”, “genre” and “performance”, ((If by chance you are a specialist of upper ontologies, I would be honoured to talk to you.)) I learned that Whishaw is now ((Well, uh, he was when I first wrote this article, but the run has ended now.)) in Broadway, where he plays John Proctor in Arthur Miller‘s The Crucible directed by Ivo van Hove. ((You are maybe totally indifferent to this information but I would have so loved to go to Broadway to see it…)) What’s interesting for all of us Wikimedians is that Broadway has already an excellent database (IBDB, Internet Broadway Database). Well done, decently complete, with a limited number of errors; ((Not only the IBDB has really few errors, they correct it in less than a week when we tell them.)) oh joy! And even better: Wikidata already has properties to link to this database (and not only for people; the properties exist also for venues, works and productions). ((Yes, I created the Wikidata entry of The Crucible production.))

Walter Kerr Theatre, ad for Grey Garden - Michael J Owens CC-BY 2.0
Walter Kerr Theatre, ad for Grey Garden – Michael J Owens CC-BY 2.0

Of course, no one had properly exploited this database before and there were many errors in the wikidatian uses. So I’ve cleaned up every and each of the uses of these properties on Wikidata. ((Not sure if it’ll make you laugh but it was the first time I saw a property where all uses were wrong. P1218 has won entry in my personal wtf? ratings. It’s cleared up now, obviously.)) And on Wikipedia, because that’s where the errors came from. ((In my experience, if there is an error on Wikidata, it nearly always comes from Wikipedia and the IBDB was no exception.)) I complained about the Wikipedians who add absurd references (or worse, don’t add references at all), who aren’t philosophically unnerved when they add a production identifier to a work entry, or who even seem to think that the IBDB identifier is the same one as the IMDB (Internet Movie Database) identifier (oh hell NO!) ((Catalan Wikipedia, I see you!)) but, as I am a Wikimedian, I cleaned up nevertheless.

I came to the conclusion that it would be better if, instead of having some correct links, we linked all the entries. Going from “I-worked-on-Ben-Whishaw-so-I-searched-his-IBDB-identifier” to “this is the complete list of IBDB identifiers, we should find the matching Wikidata entries”. For our joy, there is a truly marvellous tool called Mix n’ Match. ((Which also exist in mobile phone version for people wanting to complete Wikidata on their mobile phones (and who have a mobile phone which can go to the internet).)) Here again I could do a detailed presentation of this tool, but to keep the scope of this article I’ll just say it needs to have the complete list of valid identifiers before working; therefore I started hoarding them all. ((With help from fellow Wikimedians. They are, objectively, fantastic people.)) As it wasn’t an instantaneous process, ((But I’m done now for this English version of the article! Yay me!)) I needed to do something besides that. For those of you willing to give a hand, you can help me match IBDB entries to Wikidata entries: you can do it for works or for people. Do it carefully and if you are not sure, don’t. Thank you, any help is always appreciated.

Back to when all my scripts were running, I didn’t know exactly what to do to occupy myself, so I went again to Whishaw’s entry ((It’s still his fault.)) and noticed he was a RADA (Royal Academy of Dramatic Art) alumnus. ((Who said “finally!”?))



The cool thing about Wikidata ((In fact, there are many cool things about Wikidata.)) is that not only can we add where people studied (P69) but we can even add numerous details: when they started studying there (P580), when they stopped (P582), what degree they were preparing (P512), their academic major (P812)… There were no references. I didn’t like that at all. I searched for them. I thought: why not try the school’s website? And then… RADA!

RADA Theatre, Malet Street, Londres -- CC-BY-SA 2.0
RADA Theatre, Malet Street, Londres — CC-BY-SA 2.0

Yes. The RADA had put the profiles of its alumni online. Here is Whishaw’s page for the curious ones. ((Where we learn he listed “cat breeding” as a special interest, very important information, admit it, which I couldn’t add on Wikidata. I’m the disappointment incarnated.)) Anyway, I was seeking a source and I’ve found a goldmine. My inner Wikimedian went a little dizzy with happiness ((I will deny that I cooed at my computer seeing that. Well, maybe not deny. But it was in a dignified sort of way!)) and I told myself that now, I not only had a reference for Whishaw, I had references for all RADA alumni, with their year of graduation, their degree, everything, and that I could do mad statistics based on SPARQL queries! ((SPARQL is the language with which you can ask questions to a semantic database and it answers. And I love SPARQL.)) (and that it would give me an occupation when I retrieved the identifiers of all people who ever worked in a Broadway show). ((Did you know that more than ~150 000 people worked at Broadway?))

Naively, I thought that the RADA didn’t have so many alumni (approximately a hundred a year in recent years) and so it wouldn’t take me too much time… ((I’m a deeply optimistic kind of person.))

Identification of the relevant entries

On Wikidata

To start, I tried to know what already existed on Wikidata. I wrote a little query to find all the existing Wikidata entries with P69:Q523926 (educated at the Royal Academy of Dramatic Art). I cross-checked with the English category. Actually someone had, a few months ago, added P69:Q523926 on all the entries categorised as “Alumni of the Royal Academy of Dramatic Art”. ((Action also known as “how to upload Wikipedias’ errors on Wikidata”, see previous note.)) Anyway, at that time I had no intention of writing this blog post, so I didn’t bother writing down the actual number somewhere but it was like ~650, with a very small gap between the Wikidata query and the English category (so only a few Wikidata entries without articles on the English Wikipedia as a working hypothesis). There were more entries listed on the Wikidata query than there were articles in the category (which is logical) but all the categorised articles were correctly present in the Wikidata list. Not too bad as a start.

To follow my progress, I only had to do two queries: the first one to list all RADA alumni and the second one to list all RADA alumni with a year of graduation (which would mean that someone (me) had added the necessary information).

So beware the first SPARQL queries of this article:

link to query.


link to query.

Easy, as I said.

There were already four or five students for which we had the “end date” information, but we didn’t have a reference, or a reference other than the RADA. I decided not to care and that I would treat these cases at the same time as the others.

On Wikipedia

I had already noted that the whole English category “Alumni of the Royal Academy of Dramatic Art” had the property P69 “educated at” with the RADA value (Q523926) on Wikidata. I knew there were more entries on Wikidata than in the category: where did the difference come from? From uncategorized English articles? ((Spoiler: Yes, in part.)) From Wikidata entries without a matching article in English? ((Spoiler: Yes, that too.))

The category also exists on Wikipedias in others languages: it exists in Spanish, in Arab, in French, in Latin, in Polish, in Russian, in Simple English, in Turkish and in Chinese. But if you visit these pages, you will see they are fairly less complete than the English one (which is logical for a Londonian school) and that they would probably not help me much. ((Since then, I did a little bit of work on the French one, so it’s not as bad now as it was back then.))

However, the category isn’t the only way to spot students. The English Wikipedia also has a list (List of RADA alumni). This list ((List which is organized manually and not as a sortable table! It’s totally stupid, as we can’t easily sort by year for example, or by diploma. Urrrgh. (I created the French list at the end on July 2016 as a sortable, referenced table, just because).)) is interesting because it contains, between brackets, the year of graduation, information missing in the category.

Assuming that all articles present in the category were also on the list, or that all the entries in the list were categorized, was too big of a hope, it seems. Once more, Wikipedia dazzles us with its incomplete management; if there are two systems, of course they won’t match!

Identification: From RADA to Wikidata

I thought the easiest way to begin was to observe the RADA database and search matching entries on Wikidata and Wikipedia. There are indeed many RADA alumni known enough to have a Wikipedia article, but not all of them, let’s not exaggerate. In an ideal world where Wikidata and Wikipedia would have reached completion, once I had verified all the RADA database entries, I should have formally identified the approximately 670 Wikidata entries previously spotted. But as we don’t live in an ideal world and as neither Wikidata nor Wikipedia claims to be complete, I knew before I started that it would very probably not be so easy.

Manual research name by name

At first I thought I would simply search on Wikidata every and each student name listed on the RADA database and hope to find a match. Starting with 1906, the first year with graduates listed ((In this precise case, only one.)) as the school opened in 1904.

Very quickly the problems appeared with this painstakingly slow method.

In 1907 for example, the only student listed is “H Bentley”. The Wikidata internal search engine only returns the “H Bentley” and “H. Bentley” with a request with this name. Not “Henry Bentley”, “Harriet Bentley” or whatever. If I had been lucky, someone would have added “H Bentley” as an alias of the wikidata entry label and the search engine would have yielded a result. As I was unlucky but stubborn, I still tried a query like that (not a SPARQL one, it’s an adaptation for Autolist, an old Wikidata tool) ((I should try to do that one in SPARQL too!)):

(link to the autolist query) and hoped it would work. ((Spoiler: no. I still don’t know who is “H Bentley”. They very probably don’t have a Wikidata entry yet.)) I can also be really dedicated, search for “Bentley” and read quickly all entries… Not as easy as I hoped at first, then.

Typos and database errors

Moreover, the RADA database isn’t immune to typographical errors: I’m reasonably certain that Joan Mibourrrne doesn’t really have three Rs in her last name or Dorothy Reeeve three Es.

Desmond Llewellyn ((Because Whishaw isn’t even the first RADA student to have played Q in James Bond.)) is for example listed on the RADA database as Desmond Wilkinson (Wikipedia says he is called “Desmond Wilkinson Llewellyn”). In fact, that’s not entirely true: he is listed both as “Desmond Llewellyn” (here) and as “Desmond Wilkinson”. Yay duplicate entries! ((And here disappears the hope to know what % of RADA alumni have a Wikidata entry. Sigh.))

Desmond Llewellyn in 1983 - Towpilot CC-BY-SA 3.0
Desmond Llewellyn in 1983 – Towpilot CC-BY-SA 3.0

Actually there are many duplicates in the RADA database. I think far-fetched that there would really be two different students called “Alison James” and “Allison James” who graduated both in 1954…


Even without typographical errors, if we find a match between a name in the RADA database and a name in Wikidata, it needs verification. The Rose Hersee, graduated in 1908 isn’t the same Rose Hersee as the singer born in 1845. ((Which the RADA kindly confirmed on twitter. I love them!)) Verification is really necessary! In many cases that means that I had to read the Wikipedia article (which sometimes cites the RADA! Sometimes even with references!) and most importantly the sources used in these articles (honestly, for the first half of the 20th century, it meant reading dozens of obituaries). Sometimes—yay!—I could confirm the match. Sometimes—yay too!—I could confirm that it wasn’t the same person. But often I didn’t succeed with just a short search because the RADA profiles before 1999 are, let’s say, a little bare.

Several students can have the same name, or some people followed several courses (particularly in postgraduate technical studies). On Wikidata, many items share the same label (well, what would you expect from a name like “John Jones”?…), so it is often necessary to filter several hundreds of results to find the most probable person (and I sincerely thank every Wikimedian who ever completed Wikidata descriptions). ((Descriptions are good! Descriptions are great! We love descriptions!))


They have pseudonyms! Aaaaahhh! And an impressive number of women attained celebrity under their spouse’s name; nobody thought of adding their birth name as an alias on Wikidata. And of course, their RADA entry lists only their original name. Another impressive number of students used pseudonyms (Conrad Havord became known as “Conrad Phillips” for example). Sometimes, it’s even the opposite: the RADA lists the pseudonym they used when they were in the school, or their married name if they were married, or their nickname, and Wikipedia still uses the birth name (for example, June Flewett is listed on the RADA database as Jill Freud, her nickname and husband’s family name). I also like very much Priya Rajvansh listed on RADA as Vera Singh. Each of these cases can only be identified if someone had thought of adding the aliases on Wikidata. ((Aliases are good! Aliases are great! We love aliases too!)) And sometimes we even have combo: pseudonyms and typographical errors! We can cite Kay Hammond (pseudonym), whose birth name is “Dorothy Katherine Standing” but who is listed in the RADA website as “Kathrine Standing”. The missing “e” is sufficient for not being returned with a query or a search on Wikidata. Finding her was not easy at all and it was more luck than anything else.

Is Jean Rhys, born “Ella Gwendolen Rees Williams” in 1890 and known for using numerous pseudonyms, the same person as Ella Reeve, the RADA student who graduated in 1909? ((I really think so but I haven’t found a reference to back this claim yet. Some cases are much more difficult.)) Vern Agopsowicz became famous under the name John Vernon… I could continue like that for a long time. I went over a hundred “maybe it is them/maybe not” early in April.

Henry Darrow and John Vernon
Henry Darrow and John Vernon – NBC Television, public domain in the USA

Arkanosis helps me!

By then (late March 2016), several Wikimedians already helped me, most notably on my Internet Broadway Database work ((Ahah, had you forgotten?)) but one evening in Cléry ((Wikimedia France has a welcoming space for Wikimedians on rue de Cléry in Paris and we can be found there regularly.)) Arkanosis saw me manually searching the RADA entries and took pity on me. He wrote me a beautiful Linux shell script (later amended by Ash_Crow to become even more easy to use):

        ” > list-$profile-$year.html wget -q ‘’$profile’&yr-acting=’$year’&yr-technicaltheatrearts=’$year’&crs-technicaltheatrearts=&yr-theatrelab=’$year’&yr-directing=’$year’&crs-directing=&fn=&sn=’ -O – | \ sed -n ‘s@.*fn=\([^&]*\).*sn=\([^”&]*\).*@\1 \2@p’ | \ while read firstname lastname; do echo ”

      • $firstname $lastname wikidata” wget -q ‘’$firstname’+’$lastname -O – | \ sed -n ‘s@.*title&.*\(Q[0-9]\+\)&.*@\1@p’ | \ while read qid; do if grep -q $qid unhandled.lst; then echo ” $qid” fi done echo “


” done >> list-$profile-$year.html echo ” ” >> list-$profile-$year.html

The RADA URLs are systematically constructed like this : year/given name/surname, ((Thank you RADA technicians to have chosen to do that consistently!)) Arkanosis simply extracted listings by year, a row by student, like this:

  • Student’s name (link to the RADA entry) / Wikidata (link to the search page with the name) / eventually a Qid ((Wikidatians call “Qid” the Wikidata identifier of an item because the identifier always begin by “Q”.)) found in the second link and who also appear in the existing list of P69:Q523926 (entries already marked as RADA students)

For example a row for a student of the “acting” course in 1947 looks like:
harold goodwin wikidata Q1585750

Not all rows have a Qid associated (they were a tiny minority, honestly, as by then only ~650 student were listed and the RADA has had much more than 650 students). Not all Qid lead to correct matches either: as I said, there are some people sharing the same name at the RADA; or the Wikidata search engine was, for once, too generous and yielded combinations of given names/surnames not matching the RADA entry (for example a search for Romany Evens offers George Bramwell Evens on Wikidata). Nevertheless, the majority of the suggested Qid lead to matches, which was a way better result than for the rows without Qid. Thank you Arkanosis and Ash_Crow!

Even with these listings, having only to click on the search links instead of doing dozens of copy/paste, I still needed to verify manually each and every entry. ((Finally, the RADA has had too many alumni. Not optimistic any more.)) The problem when we use the names from the URLs, is the lack of apostrophes and blank spaces. A search of peter otoole on Wikidata doesn’t yield Peter O’Toole for example. So you still need to add the blank spaces, not just clicking and reading the results.

From RADA to Wikipedia: a temporary conclusion

I’ve spent the end of March, April and early May doing this work. At the end of it, I had identified exactly 835 entries, but of course, the vast majority of alumni didn’t have matches (which was to be expected) and a strangely high number yielded only uncertain results. I have 442 rows in a spreadsheet with each a RADA entry and a possibly matching Wikidata entry. I’ll need to dig deeper to confirm (or not) the matches.

Digging deeper - Hans Hillewaert CC-BY-SA 4.0
Digging deeper – Hans Hillewaert CC-BY-SA 4.0

Identification: from Wikipedia to the RADA

When I finished identifying alumni from the RADA database, I had a problem: there were people listed on the Wikipedia category “Alumni of the RADA” who weren’t on my done list on Wikidata. In a perfect world, at the end of the work on my scripts, the number of Wikidata entries with “studied at:RADA” and the number of Wikidata entries with “studied at:RADA, endtime:something” (and with a RADA reference the query for that) should have been the same. As it isn’t a perfect world, I had people that Wikipedia listed as alumni that I didn’t find in the RADA website. There was some overlap with my “maybe yes/maybe no” list ((Which second the “maybe yes” hypothesis.)) but not so important: my list is mostly composed of people whose drama school I don’t know at all, if they even went to one.

Using PetScan I searched for the list of Wikipedia articles categorized as RADA alumni but which didn’t respond to the query “studied at RADA with an end time”. Link to the automatically updated PetScan query.

I found 132 results, which I—again—treated manually. I identified 23 additional articles (mostly it was cases of pseudonyms or maiden names not present as aliases on Wikidata: they weren’t returned in searches as a result). ((Haven’t I already said how much I love aliases? It bears repeating.))

At the end of April, the English category listed 907 articles, Wikidata 953 entries and only 850 of them had been correctly completed with a decent reference. And we mustn’t forget that not all Wikidata entries have a matching English Wikipedia article: some actresses and actors have articles on others languages (Norwegian, Italian, German, Romanian…) and a little dozen doesn’t have a Wikipedia article at all, only the Wikidata entry without sitelinks. Their entry was created so Wikidata could list the full cast of a film.

So we query to find the Wikidata entries of RADA alumni without an end date:

link to the query.
Usually SPARQL is pretty understandable by humans because it made for querying semantic data. However Wikidata is a multilingual database, which consequently use numeric identifiers. ((Which are strangely less understandable if you don’t happen to memorize that “Q523926” means “Royal Academy of Dramatic Art”.)) I should comment all my queries but I’m lazy and I take advantage of the Wikidata endpoint which declare itself the needed PREFIX and even offers comments: if you hover over a Pid or Qid, you’ll see the name and description in your language. And you can change this language in the top right corner. So I’m entitled to laziness. ((So there.))


This list contains mostly entries with sitelinks to the English Wikipedia: the SPARQL query above (on Wikidata but without end date for RADA studies) yielded 112 results at the end of April when the PetScan query (in the English category of RADA alumni but without end time on Wikidata) gave us 110 results. One of these is an article deleted on the English Wikipedia after someone imported the category on Wikidata and the other is about a French actress. So all 111 of these “maybe errors, maybe not, but in all cases lacking references” on Wikidata came from the English Wikipedia. I SEE YOU ENWIKI!

The work now is to find under which name the person was registered at the RADA (beware typographical errors…) or to find why they were categorized as students when they weren’t. For example Ash_Crow corrected the article on George Bernard Shaw, listed as student instead of “people associated with” the RADA. He was very implicated in the school and even gave them part of his heritage ((Which was very kind of him.)) but never studied here. For Armaan Kirmani, his IMDB entry says that he was the student of a RADA professor… but that doesn’t mean he went to RADA itself.

Men, black and white photograph
George Bernard Shaw in 1915 – Public domain in the USA.

In these dozens of problematic cases, there is a little bit of everything, from articles that don’t mention the RADA at all (why were they ever categorized?), to articles that clearly state that the person was a student (but without any sources), ((Like Margaret Rutherford: someone added summarily the information in 2008, the page is categorized in 2010…)) to articles that even have sources but these sources aren’t so explicit… The RADA doesn’t propose exclusively graduating courses; they also organize numerous workshops and internships. If an actor or an actress participated in a two-days workshop at RADA, they won’t appear on the RADA website as a student but they could sincerely say in interviews they learned something at RADA… We are only a step away from an enthusiastic Wikipedian deciding they are alumni.

For example Ash_Crow found a source (in French and of not really great quality) saying that Émilie Rault studied at RADA. She is nowhere on the database because it’s very likely she only did workshops there, as she was also studying musicology at the Sorbonne for her master diploma at the time. This should lead us to question the limits we want to fix to the “studied at” property on Wikidata: do we want to use it exclusively for long formations with diplomas or accept everything including workshops of only a few days?

Differences between the list and the category

Like I already said, the “List of RADA alumni” doesn’t match the articles listed in the category. Systematically, every time I’ve identified someone on Wikidata (and subsequently found their Wikidata article), I added their name on the list and I added the category. So I’ve reduced the gap between the two. The article-list should be more complete than the category, since it can hold red links existing on other Wikipedias.

Xavier Combelle has been kind enough to list the differences between the category and the list in early May, the thirty problematic cases mentioned above remained (missing from the list) and in the list, in addition to the usual red links, we found eighteen uncategorised articles. None of them bore any obvious connection to RADA, except for Xenia Kalogeropoulou which could be identified as Xenia Calogeropoulos and was thus categorised. Among those cases, some Wikipedia articles explicitly mentioned training at RADA as constituting in workshops or interships. We go back to the question: what courses warrant being considered an alumni?

Problems with the RADA database

Having listed issues on Wikipedia and Wikidata sides (which amount of: “people add information without references and that information spread everywhere like an epidemic”), we have to face the fact that some of the problems stem from the RADA database itself.

Completeness of data

As we have already seen, the database is littered with double entries, each pseudonym or name spelling yielding a new page instead of centralising these entries on a unique page associated with the student. This is obviously a problem if you are interested in the number of students for a given year, for instance.

From a Wikidata point of view, this prevents resorting to the simple solution of creating one entry for each student, independently on whether a Wikipedia article exists or not. The Cambridge database, for example, associates a fixed identifier to every student, which enabled us to import these identifiers on Wikidata, creating new entries as needed (P1599: ID of the Cambridge Alumni Database). ((Mix n’ Match can tag an identifier as needing a Wikidata entry to be created.)) If the RADA had chosen the solution of one identifier per student instead of the URL of the form diploma/year/first name/last name, it would have been easier to import it in its entirety.

Which brings us to the next problem: we have no certainty that the database is complete at all. Nothing to support that is said on the site. A visit to the Internet Archive’s Wayback Machine shows that the database has only been online since 2015, and that before that date only the current students had a profile on the site. If recent data seem complete (from 1999 on, where profiles are detailed and come with photographs), the profiles of the earlier years are sometimes quite patchy. And in particular, some years seem suspiciously poor in students, such as 1988 and every year before 1922. ((With the exception of 1908 and 1909, all other years only have a unique student listed.))

Could it be that among the dozens RADA alumni without a match in the database, some have been forgotten? One typical case is the one Noel Streatfeild who, according to her website, attended as a student starting in 1919. I did find a “Noel Goodwin” who graduated in 1922, but is that her?

Another example even more explicit is Dora Mavor Moore, who was the first Canadian who went to RADA, per this biographical article, and who graduated in 1912. The problem is, on the RADA website only one student is listed as graduating in 1912 and “Leonard Notcutt” isn’t a known pseudonym of Dora Mavor Moore.

Data reliability

The more strident problem is that some alumni listed in the RADA database left the RADA before graduation. Someone like Harold Pinter has a RADA profile which says he was part of the 1949 class. In fact, Pinter went to RADA in 1948 and left the course in 1949, before the graduation. Does the RADA list every student, no matter if they are actually graduates or if they didn’t finish? In Wikidata we can use the property “diploma” with “no value” instead of the actual diploma in the qualifiers for the “studied at” property.

Wikidata screenshot to the "studied at" statement of Harold Pinter's entry
Wikidata screenshot to the “studied at” statement of Harold Pinter’s entry

It’s a little bit problematic if we can’t trust the official school website to know who has been graduated there…

I have another problem with the RADA entry of Sheila Terry, whom I think I can match to the Wikipedia article Sheila Terry. It’s very likely she didn’t go to London during her studies; according to Wikipedia, she went to the Dickson-Kenwin academy, “a school affiliated with London’s Royal Academy”. Does that means the Dickson-Kenwin academy was then delivering the RADA diplomas? (before the 2000 reform, the RADA delivered its own diplomas). I lack information.

I also have a Jack May of the 1943 class whose Wikipedia article states explicitly he was admitted to RADA and never went

Never so easy, even when the matchings are done!

What am I doing now?

I still do many other things on Wikidata. This article resumes some of my work but not all, far from it. But to stay somehow on topic I’ll only speak here of what I do in relation with the RADA and theatre in general.

For example, people justly said that Wikidata has a property to indicate the birth name of one person, which should always be present (but isn’t in reality) and is useful in particular in cases of women known under their married name. So I’m working to add these birthnames-in-property as aliases to facilitate the future identification. It’s a lot of fun with little scripts, SPARQL and an healthy use of QuickStatements, a tool made to facilitate bulk editions on Wikidata.

I’m also still working on Mix n’ Match to add the correct IBDB identifiers to Wikidata entries about people and works. You can help me, as I already said above. And it’s not just for the pleasure to have identifiers; when we will have enough of them, we will be able to add many informations about Broadway productions on Wikidata. And that will be fantastic!

I started adding data about theatrical awards too, which is long, somewhat repetitive, but is immediately useful. The English Wikipedia mostly already has articles about the most important awards, but many smaller Wikipedia don’t. I’m working on a lua module to be able to generate a Wikinews article based on Wikidata data: ((But frankly, my lua isn’t good enough right now.)) in practice, that will mean watching the Tony awards ceremony, adding the data on Wikidata and immediately after the end being able to have a complete table with links and everything just using a template. ((Tpt created a similar lua module for me when I worked on sled dog races. You can see it in use in this French Wikinews article: the table is dynamic and show the content of the race’s Wikidata entry.)) And that in dozens of languages. Great, no?

I still have to reduce my two lists of RADA students:

  1. one with people categorized as alumni but whom I didn’t find in the RADA website (errors? workshops? missing?): ~112
  2. one with Wikidata entries I think match a RADA student but I don’t have a definite proof: ~400

Solving these two lists should help me reduce the gap between the English category and the English list. And by the way I’m very proud of my French list of RADA alumni, which has names, date, course, diploma, nationality and even some pictures!

I wrote to the RADA archivist in June to at least inform him of the typrographical errors found in their database but he didn’t write me back for now. Which probably can be explained by the fact the RADA archives are moving to a new building this summer. They are probably pretty busy!

And of course, for a purely Wikidatian point of view, I officially launched the WikiProject:Theatre this week. That’s for every Wikidatian new or confirmed who want to join me in my mad quest.

Curious and fun queries and statistics

Everything being said, we still have an interesting sample with ~850 entries. It’s only a small percent of all RADA alumni (and the technical courses are vastly under-represented) but it’s enough to start to have fun with SPARQL queries. We can ask pretty much anything!

If you want to see the results of the queries, click on the links then sur “Run” and a few seconds after, you will be able to explore the answers yourself!

Number of RADA student with a Wikidata entry by year

Well, starting easily, maybe we don’t want the list of RADA alumni but only the number of them with an entry by year of graduation:

Query link.

Screenshot of the SPARQL query
Screenshot of the SPARQL query

We can then do this beautiful graph:

RADA alumni with a Wikidata entry by year of graduation
RADA alumni with a Wikidata entry by year of graduation

Average age at graduation

Maybe we can go further. Now that we know when they graduated, can we know at what age they did it? This means our sample will be reduced to the entries with a birthdate of course.

Query link

Screenshot of the SPARQL query
Screenshot of the SPARQL query

Or even something more fun: the average age of graduation (with entire values only, this time), by year and by gender (only “male” and “female” in our sample, but the query could handle others) and to have appearance of seriousness, the number of people in the sample:

Query link. I should really do an age pyramid but I suffer of a fit of laziness. ((Yes, again.))

screenshot of the SPARQL query
screenshot of the SPARQL query

Timeline of graduates

That was fun but I want something more human-readable, like a timeline with pictures!

Query link. You just needed to ask! Beware that this query is heavy and can slow down your browser.

Screenshot of the SPARQL query - timeline
Screenshot of the SPARQL query – timeline

How many nationalities were represented in RADA?

We can do a query to list all the nationalities, and for each the number of students involved, in decreasing order.

Query link. Surprisingly ((Or not.)) the most frequent is the… British. But hey! More than thirty nationalities!

Screenshot of the SPARQL query
Screenshot of the SPARQL query

If we just add, as the first line of the query:

we obtain the results as a bubble chart. ((No surprise here.)) It’s explicit:

Map of birth places of RADA students

I don’t really care what nationalities the alumni are… but I would love to see a map of birthplaces! And I can do that directly in SPARQL!

Query link.

Map of RADA alumni birth places
Map of RADA alumni birth places

If you run the query, you can zoom!

Layered map of birth places of RADA students by graduation date

Just a simple map? But why not make a layered one? We could ask for a map showing the birth places of RADA graduates, one layer by decade of graduation!

Query link. Run it and play with the layers!

Screenshot of the query
Screenshot of the query

Number of RADA alumni cast in a James Bond film

Do you remember that Whishaw and Llewellyn played Q? Exactly how many RADA students did play in a James Bond film?

Query link.
More than forty!


RADA alumni by James Bond films, ordered by date

Hsarrazin liked my James Bond query but she wanted more: she wanted to know which former student was cast in which James Bond film, and to order the results by publication date of the film. Of course this means that people who worked in several films are listed several times.

Query link.

Screenshot of the query results
Screenshot of the query results

RADA alumni by James Bond films, in a graph

I don’t actually know why Hsarrazin wanted a table, when we could have all RADA alumni playing in a James Bond film as a graph:

Query link. Beware, this query is heavy, even more than the timeline one. ((Am I the only weird one out there who like to poke the graph and see it rearrange itself all around my screen?))

Screenshot of the graph generated by the query
Screenshot of the graph generated by the query

And in all films?

Well, James Bond is great, but why limit ourselves to it? Can’t we just have all films on Wikidata with more than 5 actors or actresses listed in the casting, ordered by the number of them who studied at RADA?

Query link.


Films by rate of actors and actresses who studied at RADA

That’s fun but I want even more fun: it’s not exactly the same if there are five RADA actors-actresses in a distribution of eight or in a distribution of one hundred. I want all films with at least five people in casting ordered by rate of RADA students!

Query link. Honestly the more “high rated” are probably films with an incomplete cast but I love this query anyway.


RADA alumni who worked in Broadway

Remember when we worked on the Broadway database? How many RADA students ever worked in Broadway? (We are considering that “working in Broadway” means “having an Internet Broadway Database identifier”). Well at least…

Query link. 276 listed mid-August 2016!

Screenshot of the query results
Screenshot of the query results

RADA alumni with a Tony award win or nomination

So… if RADA alumni worked in Broadway… how many of them were nominated to or received a Tony award?

Query link. More than one hundred!

Screenshot of the query results
Screenshot of the query results

And by the way we should verify that all people nominated to or awarded a Tony Award have an IBDB identifier! (if all is right in the world, this query should lend you a “No matching records found”):

Query link.

All Tony Awards!

Hey! Can we list all people nominated to/awarded a Tony, by win/nomination, by award and by year? Like everyone ever? Well yes, of course, it’s SPARQL!

Query link.

Screenshot of the query results
Screenshot of the query results


  1. I’m not done;
  2. I hope the RADA archivist will be as kind as he seems;
  3. People, seriously, you should add aliases on Wikidata items;
  4. And sources. Sources are great;
  5. And you also should take photographs of Ben Whishaw, we are clearly lacking on free Whishaw’s pictures;
  6. Isn’t SPARQL a lot of fun? Whatever your question, someone can ask Wikidata for the answer! ((And I can write footnotes, which are clearly the other fun part of this article, hands down.))

(main picture :Pediment of the RADA building in Gower Street, by Chemical Engineer, CC-BY-SA 3.0)