Sunday Query: The 200 Oldest Living French Actresses
Hello, here’s Harmonia Amanda again (I think Ash_Crow just agreed to let me squat his blog indefinitely). This article shouldn’t be too long, for once.[1] It started as an hashtag on twitter #SundayQuery, where I wrote SPARQL queries for people who couldn’t find how to ask their question. So this article, and the next ones if I keep this up, are basically a kind of “how to translate a question in SPARQL step by step” tutorial.
The question this week was asked by Jean-No: who are the oldest French actresses still alive?
To begin: the endpoint and PREFIX…
SPARQL is the language you use to query a semantic database. For that you can use an endpoint, which are services that accept SPARQL queries and return results. There are many SPARQL endpoints around, but we will be using the Wikidata endpoint.
A semantic base is constituted of triples of information: subject, predicate, object (in Wikidata, we usually call that item, property, value, but it’s the exact same thing). As querying with full URIs all the time would be really tedious, SPARQL needs PREFIX
.
PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wikibase: <http://wikiba.se/ontology#> PREFIX p: <http://www.wikidata.org/prop/> PREFIX ps: <http://www.wikidata.org/prop/statement/> PREFIX pq: <http://www.wikidata.org/prop/qualifier/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX bd: <http://www.bigdata.com/rdf#>
These are the standard prefixes used to query Wikidata. If used, they should be stated at the top of the query. You can add other prefixes as needed to. Actually, the Wikidata endpoint has been created to specifically query Wikidata, so all these prefixes are already declared, even if you don’t see them. But it’s not because they aren’t visible that they aren’t there, so always remember: SPARQL NEEDS PREFIX. There.
All French women
The first thing to declare after the prefixes, is what we are asking to have for results. We can use the command SELECT
or the command SELECT DISTINCT
(this one ask to clear off duplicates) and then listing the variables.
As we are looking for humans, we’ll call this variable “person” (but we could choose anything we want).
SELECT ?person
We will then define the conditions we want our variable to respond to, with WHERE
. We are seeking the oldest French actresses still alive. We need to cut that in little understandable bits. So the first step is, we are querying for human beings (and not fictional actresses). Humans beings are all items which have the property “instance of” (P31) with the value “human” (Q5).
So:
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 . }
This query will return all humans in the database.[2] But we don’t want all humans, we want only the female ones. So we want humans who also answer to “sex or gender” (P21) with the value “female” (Q6581072).
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 . ?person wdt:P21 wd:Q6581072 . }
And we want them French as well! So with “country of citizenship” (P27) with the value “France” (Q142).
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 . ?person wdt:P21 wd:Q6581072 . ?person wdt:P27 wd:Q142 . }
Here it is, all French women. We could also write the query this way:
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 ; wdt:P21 wd:Q6581072 ; wdt:P27 wd:Q142 . }
It’s exactly the same query.
All French actresses
Well that seems like a good beginning, but we don’t want all women, we only want actresses. It could be as simple as “occupation” (P106) “actor” (Q33999) but it’s not. In reality Q33999 doesn’t cover all actors and actresses: it’s a class. Subclasses, like “stage actor”, “television actor”, “film actor” have all “subclass of” (P279) “actor” (Q33999). We don’t only want the humans with occupation:actor; we also want those with occupation:stage actor and so on.
So we need to introduce another variable, which I’ll call ?occupation
.
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 . #All humans ?person wdt:P21 wd:Q6581072 . #Of female gender ?person wdt:P27 wd:Q142 . #With France as their citizenship country ?person wdt:P106 ?occupation . #With an occupation ?occupation wdt:P279* wd:Q33999 . #This occupation is or is a subclass off "actor" }
Actually, if we really wanted to avoid the ?occupation
variable, we could have written:
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 ; wdt:P21 wd:Q6581072 ; wdt:P27 wd:Q142 ; wdt:P106/wdt:P279* wd:Q33999 . }
I’s the same thing, but I found it less clear for beginners.
Still alive
So now we need to filter by age. The first step is to ensure they have a birth date (P569). As we don’t care (for now) what this birth date is, we’ll introduce a new variable instead of setting a value.
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 . ?person wdt:P21 wd:Q6581072 . ?person wdt:P27 wd:Q142 . ?person wdt:P106 ?occupation . ?occupation wdt:P279* wd:Q33999 . ?person wdt:P569 ?birthDate . #With a birth date }
Now we want the French actresses still alive, which we’ll translate to “without a ‘date of death’ (P570)”. For that, we’ll use a filter:
SELECT ?person WHERE { ?person wdt:P31 wd:Q5 . ?person wdt:P21 wd:Q6581072 . ?person wdt:P27 wd:Q142 . ?person wdt:P106 ?occupation . ?occupation wdt:P279* wd:Q33999 . ?person wdt:P569 ?birthDate . #With a birth date FILTER NOT EXISTS { ?person wdt:P570 ?deathDate . } #Without a death date }
Here it is, all French actresses still alive!
The Oldest
To find the oldest, we’ll need to order our results. So after the WHERE
section of our query, we’ll add another request: now that you have found me my results, SPARQL, can you order them as I want? Logically, this is called ORDER BY
. We can “order by” the variable we want. We could order by ?person
, but as it’s the only variable we have selected in our query, it’s already done. To order by a variable, we need to ask to select this variable first.
The obvious way would be to order by ?birthDate
. The thing is, Wikidata sometimes has more than one birth date for people because of conflicting sources, which translates in the same people appearing twice. So, we’ll group the people by their Qid (using GROUP BY
) si that duplicates exist now as a group. Then we use SAMPLE
(in our SELECT
) to take only the groups’ first birth date that we find… And then ORDER BY
it :
SELECT ?person (SAMPLE(?birthDate) AS ?date) WHERE { ?person wdt:P31 wd:Q5 . ?person wdt:P21 wd:Q6581072 . ?person wdt:P27 wd:Q142 . ?person wdt:P106 ?occupation . ?occupation wdt:P279* wd:Q33999 . ?person wdt:P569 ?birthDate . FILTER NOT EXISTS { ?person wdt:P570 ?deathDate . } } GROUP BY ?person ORDER BY ?date
This gives us all French actresses ordered by date of birth, so the oldest first. We can already see that we have problems with the data: actresses whom we don’t know the date of birth (with “unknown value”) and actresses manifestly dead centuries ago but whom we don’t know the date of death. But still! Here is our query and we answered Jean-No’s question as we could.
With labels
Well, we certainly answered the query but this big list of Q and numbers isn’t very human-readable. We should ask to have the labels too! We could do this the proper SPARQL way, with using RDFS and such, but we are querying Wikidata on the Wikidata endpoint, so we’ll use the local tool instead.
We add to the query:
SERVICE wikibase:label { bd:serviceParam wikibase:language "fr,en" . }
which means we ask to have the labels in French (as we are querying French people), and if it doesn’t exist in French, then in English. But just adding that doesn’t work: we need to SELECT
it too! (and to add it to the GROUP BY
). So:
SELECT ?person ?personLabel (SAMPLE(?birthDate) AS ?date) WHERE { ?person wdt:P31 wd:Q5 . ?person wdt:P21 wd:Q6581072 . ?person wdt:P27 wd:Q142 . ?person wdt:P106 ?occupation . ?occupation wdt:P279* wd:Q33999 . ?person wdt:P569 ?birthDate . FILTER NOT EXISTS { ?person wdt:P570 ?deathDate . } SERVICE wikibase:label { bd:serviceParam wikibase:language "fr,en" . } } GROUP BY ?person ?personLabel ORDER BY ?date
Tada! It’s much more understandable now!
Make it pretty
I don’t want all living French actresses, including the children, I want the oldest; so I add a LIMIT
to the results.[3]
} GROUP BY ?person ?personLabel ORDER BY ?date LIMIT 200
And why should I have a table with identifiers and names when I could have the results as a timeline? We ask that at the top, even before the SELECT
:[4]
#defaultView:Timeline
Wait, wait! Can I have pictures too? But only if they have a picture, I still want the results if they don’t have one (and I want only one picture if they happen to have several so I use again the SAMPLE
). YES WE CAN:
#defaultView:Timeline SELECT ?person ?personLabel (SAMPLE(?birthDate) AS ?date) (SAMPLE(?photo) AS ?pic) WHERE { ?person wdt:P31 wd:Q5 . ?person wdt:P21 wd:Q6581072 . ?person wdt:P27 wd:Q142 . ?person wdt:P106 ?occupation . ?occupation wdt:P279* wd:Q33999 . ?person wdt:P569 ?birthDate . FILTER NOT EXISTS { ?person wdt:P570 ?deathDate . } OPTIONAL { ?person wdt:P18 ?photo . } SERVICE wikibase:label { bd:serviceParam wikibase:language "fr,en" . } } GROUP BY ?person ?personLabel ORDER BY ?date LIMIT 200
Here the link to the results. I hope you had fun with this SundayQuery!
Header image:
Mademoiselle AmbroisiMademoiselle Ambroisine – created between 1832 and 1836 by an artist whose name is unreadable – Public Domain.ne – created between 1832 and 1836 by an artist whose name is unreadable – Public Domain.
Claire May 18, 2018, 1:39 p.m. ¶
Perfect tuto, thank you Harmonia Amanda ;)
Harmonia Amanda Sept. 11, 2016, 7:31 p.m. ¶
And thank you Ash_Crow for your help clarifying the use of SAMPLE. It's better like this.
Izno Sept. 13, 2016, 7:01 p.m. ¶
The upper range for most people living is 125-130 years. This seems like a good way to filter birth date. Another would be to filter out "unknown" death dates. How might those queries look? (Feel free to use my Wikidata talk page to respond.)