Sunday Query: The 200 Oldest Living French Actresses

Hello, here’s Harmonia Amanda again (I think Ash_Crow just agreed to let me squat his blog indefinitely). This article shouldn’t be too long, for once. [1]Well, nothing to compare to my previous one, which all of you should have read. It started as an hashtag on twitter #SundayQuery, where I wrote SPARQL queries for people who couldn’t find how to ask their question. So this article, and the next ones if I keep this up, are basically a kind of “how to translate a question in SPARQL step by step” tutorial.

The question this week was asked by Jean-No: who are the oldest French actresses still alive?

To begin: the endpoint and PREFIX…

SPARQL is the language you use to query a semantic database. For that you can use an endpoint, which are services that accept SPARQL queries and return results. There are many SPARQL endpoints around, but we will be using the Wikidata endpoint.

A semantic base is constituted of triples of information: subject, predicate, object (in Wikidata, we usually call that item, property, value, but it’s the exact same thing). As querying with full URIs all the time would be really tedious, SPARQL needs PREFIX.

These are the standard prefixes used to query Wikidata. If used, they should be stated at the top of the query. You can add other prefixes as needed to. Actually, the Wikidata endpoint has been created to specifically query Wikidata, so all these prefixes are already declared, even if you don’t see them. But it’s not because they aren’t visible that they aren’t there, so always remember: SPARQL NEEDS PREFIX. There.

All French women

The first thing to declare after the prefixes, is what we are asking to have for results. We can use the command SELECT or the command SELECT DISTINCT (this one ask to clear off duplicates) and then listing the variables.

As we are looking for humans, we’ll call this variable “person” (but we could choose anything we want).

We will then define the conditions we want our variable to respond to, with WHERE. We are seeking the oldest French actresses still alive. We need to cut that in little understandable bits. So the first step is, we are querying for human beings (and not fictional actresses). Humans beings are all items which have the property “instance of” (P31) with the value “human” (Q5).

So:

This query will return all humans in the database. [2]Well, if it doesn’t timeout through the sheer number. But we don’t want all humans, we want only the female ones. So we want humans who also answer to “sex or gender” (P21) with the value “female” (Q6581072).

And we want them French as well! So with “country of citizenship” (P27) with the value “France” (Q142).

Here it is, all French women. We could also write the query this way:

It’s exactly the same query.

All French actresses

Well that seems like a good beginning, but we don’t want all women, we only want actresses. It could be as simple as “occupation” (P106) “actor” (Q33999) but it’s not. In reality Q33999 doesn’t cover all actors and actresses: it’s a class. Subclasses, like “stage actor”, “television actor”, “film actor” have all “subclass of” (P279) “actor” (Q33999). We don’t only want the humans with occupation:actor; we also want those with occupation:stage actor and so on.

So we need to introduce another variable, which I’ll call ?occupation.

Actually, if we really wanted to avoid the ?occupation variable, we could have written:

I’s the same thing, but I found it less clear for beginners.

Still alive

So now we need to filter by age. The first step is to ensure they have a birth date (P569). As we don’t care (for now) what this birth date is, we’ll introduce a new variable instead of setting a value.

Now we want the French actresses still alive, which we’ll translate to “without a ‘date of death’ (P570)”. For that, we’ll use a filter:

Here it is, all French actresses still alive!

The Oldest

To find the oldest, we’ll need to order our results. So after the WHERE section of our query, we’ll add another request: now that you have found me my results, SPARQL, can you order them as I want? Logically, this is called ORDER BY. We can “order by” the variable we want. We could order by ?person, but as it’s the only variable we have selected in our query, it’s already done. To order by a variable, we need to ask to select this variable first.

The obvious way would be to order by ?birthDate. The thing is, Wikidata sometimes has more than one birth date for people because of conflicting sources, which translates in the same people appearing twice. So, we’ll group the people by their Qid (using GROUP BY) si that duplicates exist now as a group. Then we use SAMPLE (in our SELECT) to take only the groups’ first birth date that we find… And then ORDER BY it :

This gives us all French actresses ordered by date of birth, so the oldest first. We can already see that we have problems with the data: actresses whom we don’t know the date of birth (with “unknown value”) and actresses manifestly dead centuries ago but whom we don’t know the date of death. But still! Here is our query and we answered Jean-No’s question as we could.

With labels

Well, we certainly answered the query but this big list of Q and numbers isn’t very human-readable. We should ask to have the labels too! We could do this the proper SPARQL way, with using RDFS and such, but we are querying Wikidata on the Wikidata endpoint, so we’ll use the local tool instead.

We add to the query:

which means we ask to have the labels in French (as we are querying French people), and if it doesn’t exist in French, then in English. But just adding that doesn’t work: we need to SELECT it too! (and to add it to the GROUP BY). So:

Tada! It’s much more understandable now!

Make it pretty

I don’t want all living French actresses, including the children, I want the oldest; so I add a LIMIT to the results. [3]And it will be quicker that way.

And why should I have a table with identifiers and names when I could have the results as a timeline? We ask that at the top, even before the SELECT: [4]Beware that this doesn’t work on all SPARQL endpoints out there, even if it exists on the Wikidata one.

Wait, wait! Can I have pictures too? But only if they have a picture, I still want the results if they don’t have one (and I want only one picture if they happen to have several so I use again the SAMPLE). YES WE CAN:

Here the link to the results. I hope you had fun with this SundayQuery!

(Main picture: Mademoiselle Ambroisine – created between 1832 and 1836 by an artist whose name is unreadable – Public Domain.)

Notes   [ + ]

1. Well, nothing to compare to my previous one, which all of you should have read.
2. Well, if it doesn’t timeout through the sheer number.
3. And it will be quicker that way.
4. Beware that this doesn’t work on all SPARQL endpoints out there, even if it exists on the Wikidata one.

3 thoughts on “Sunday Query: The 200 Oldest Living French Actresses”

  1. The upper range for most people living is 125-130 years. This seems like a good way to filter birth date.

    Another would be to filter out “unknown” death dates.

    How might those queries look? (Feel free to use my Wikidata talk page to respond.)

Leave a Reply

Your email address will not be published. Required fields are marked *