Sunday Query: all surnames marked as disambiguation pages, with an English Wikipedia link and with « (surname) » in their English label

Harmonia_Amanda Wikidata

It’s Sunday again! Time for the queries! Last week I showed you the basics of SPARQL; this week I wanted to show you how we could use SPARQL to do maintenance work. I assume you now understand the use of PREFIX, SELECT, WHERE.

I have been a member of the WikiProject:Names for years. When I’m not working on Broadway and the Royal Academy of Dramatic Art archives,[1] I am one of the people who ensure that “given name:Christopher (Iowa)” is transformed back to “given name:Christopher (given name)”. Over the last weeks I’ve corrected thousands of wrong uses of the given name/family name properties, and for this work, I used dozens of SPARQL queries. I thought it could be interesting to show how I used SPARQL to create a list of strictly identical errors that I could then treat automatically.

If you read the constraints violations reports, you’ll see that the more frequent error for the property “family name” (P734) is the use of a disambiguation page as value instead of a family name. We can do a query like that:

SELECT ?person ?personLabel ?name ?nameLabel
WHERE {
    ?person wdt:P734 ?name . #the person has a family name
    ?name wdt:P31 wd:Q4167410 . #the item used as family name is a disambiguation page
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" . } #We want to have the results in English or in French if there is no English label
}

link to the query. The results are in the thousands. Sigh.

But then we find something more interesting: there are entities which are both a disambiguation page and a family name. What?! That’s ontologically wrong. To use the wrong value as family name is human error; but an entity can’t be both a specific type of Wikimedia page and a family name. It’s like saying a person could as well be a book. Ontologically absurd. So all items with both P31 need to be corrected. How many are there?

SELECT DISTINCT ?name ?nameLabel (LANG(?label) AS ?lang)
WHERE {
    ?name wdt:P31 wd:Q101352 ; #the entity is a family name
          wdt:P31 wd:Q4167410 . #the entity is also a disambiguation page
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" . } #We want to have the results in English or in French if there is no English label
}

link to the query.
Several thousands again. Actually, there are more entities which are both a disambiguation page and a family name than there are person using disambiguation pages as family names. This means there are family names/disambiguation pages in the database which aren’t used. They’re still wrong but it doesn’t show in violation constraints reports.

If we explore, we see than there are different cases out there: some of the family names/disambiguation pages are in reality disambiguation pages, some are family names, some are both (they link to articles on different Wikipedia, some about a disambiguation page and some about a family name; these need to be separated). Too many different possibilities: we can’t automatize the correction. Well… maybe we can’t.

If we can’t treat in one go all disambig/family name pages, maybe we can ask a more precise question. In our long list of violations, I asked for English label and I found some disbelieving ones. There were items named “Poe (surname)”. As disambiguation pages. That’s a wrong use of label, which shouldn’t have precisions about the subject in brackets (that’s what the description is for) but if they are about a surname they shouldn’t be disambiguation pages too! So, so wrong.

Querying labels

But still, the good news! We can isolate these entries. For that we’ll have to query not the relations between items but the labels of the items. Until now, we had used the SERVICE wikibase:label workaround, a tool which only exists on the Wikidata endpoint, because it was really easy and we only wanted to have human-readable results, not really to query labels. But now that we want to, the workaround isn’t enough, we’ll need to do it the real SPARQL way, using rdfs.

Our question now is: can I list all items which are both family names and disambiguation pages, whose English label contains “(surname)”?

SELECT DISTINCT ?name ?label (LANG(?label) AS ?lang)
WHERE {
    ?name wdt:P31 wd:Q101352 ; #the entity is a family name
          wdt:P31 wd:Q4167410 ; #the entity is also a disambiguation page
          rdfs:label ?label . #the entity have a label
    FILTER(LANG(?label) IN ("en")). #this label exists in English
    FILTER(CONTAINS(?label, "(surname)")). #this label contains a specific string
}

link to the query. We had several hundreds results.[2] You should observe the changes I made in the SELECT DISTINCT as I don’t use the SERVICE wikibase:label workaround.

Can we automatize correction now? Well… no. There is still problems. In this list, there are items which have links to several Wikipedia, the English one about the surname and the other(s) ones about a disambiguation page. Worse, there are items which don’t have an English interwiki any longer, because it was deleted or linked to another item (like the “real” family name item) and the wrong English label persisted. Si maybe we can filter our list to only items with a link to the English Wikipedia. For this, we’ll use schema.

SELECT DISTINCT ?name ?label (LANG(?label) AS ?lang)
WHERE {
    ?name wdt:P31 wd:Q101352 ; #the entity is a family name
          wdt:P31 wd:Q4167410 ; #the entity is also a disambiguation page
          rdfs:label ?label . #the entity have a label
    ?sitelink schema:about ?name .  #We want the entity to have a sitelink
              schema:inLanguage "en" ; #this sitelink is in English
              schema:isPartOf  <https://en.wikipedia.org/> . #and link to the English WP (and not Wikisource or others projects)
    FILTER(LANG(?label) IN ("en")). #the label exists in English
    FILTER(CONTAINS(?label, "(surname)")). #the label contains a specific string
}

link to the query. Well, that’s better! But our problem is still here: if they have several sitelinks, maybe the other(s) sitelink are not about the family name. So we want the items with an English interwiki and only an English interwiki. Like this:

SELECT DISTINCT ?name ?label (LANG(?label) AS ?lang)
WHERE {
    ?name wdt:P31 wd:Q101352 ; #the entity is a family name
          wdt:P31 wd:Q4167410 ; #the entity is also a disambiguation page
          rdfs:label ?label . #the entity have a label
    ?sitelink schema:about ?name .  #We want ?name to have a sitelink
    ?WParticle schema:about ?name ; #We'll define the characteristic of the sitelink
               schema:inLanguage "en" ; #this sitelink is in English
               schema:isPartOf  <https://en.wikipedia.org/> . #and link to the English WP (and not Wikisource or others projects)
    FILTER(LANG(?label) IN ("en")). #the label exists in English
    FILTER(CONTAINS(?label, "(surname)")). #the label contains a specific string
} GROUP BY ?name ?label HAVING (COUNT(DISTINCT ?sitelink) = 1) #With only one sitelink

link to the query.

Several things: we separated ?sitelink and ?WParticle. We use ?sitelink to query the number of sitelinks, and ?WParticle to query the particular of this sitelink. Note that we need to use GROUP BY, like last week.

Polishing of the query

Just to be on the safe side (we are never safe enough before automatizing corrections) we’ll also check that all items on our list are only family name/disambiguation pages; they’re not also marked as a location or something equally strange. So we query that they have only two P31 (instance of), these two being defined as Q101352 (family name) and Q4167410 (disambiguation page).

SELECT DISTINCT ?name ?label (LANG(?label) AS ?lang)
WHERE {
    ?name wdt:P31 ?type ; #the entity use the property P31 (instance of)
          wdt:P31 wd:Q101352 ; #the entity is a family name
          wdt:P31 wd:Q4167410 ; #the entity is also a disambiguation page
          rdfs:label ?label . #the entity have a label
    ?sitelink schema:about ?name .  #We want ?name to have a sitelink
    ?WParticle schema:about ?name ; #We'll define the characteristic of the sitelink
               schema:inLanguage "en" ; #this sitelink is in English
               schema:isPartOf  <https://en.wikipedia.org/> . #and link to the English WP (and not Wikisource or others projects)
    FILTER(LANG(?label) IN ("en")). #the label exists in English
    FILTER(CONTAINS(?label, "(surname)")). #the label contains a specific string
} GROUP BY ?name ?label HAVING ((COUNT(DISTINCT ?type) = 2) && (COUNT(DISTINCT ?sitelink) = 1)) #With only two P31 and one sitelink

link to the query.

It should give you a beautiful “no matching records found”. Yesterday, it gave me 175 items which I knew I could correct automatically. Which I have done, with a python script made by Ash_Crow. If you are good, he’ll make a #MondayScript in response to this #SundayQuery!

Header image:

Name List of Abhiseka – Public Domain, photograph done by Kūkai.

Tags:

0 comments posted.

Footnotes

  1. Yes, I really want you to read this one.

  2. Then. We had several hundreds result then. I’m happy to say it isn’t true now.

Comments

Comments are closed.