Behind the scenes, I’m working hard at writing a much improved Beardscratchers Compendium while still trying to trickle out new features in the current version. Recent browsing will have revealed the automatic inclusion of abstracts for both artists and releases direct from Wikipedia. Implementing this feature seemed, at first glance, to be very simple.

In practice it ended up requiring the use of a completely separate API, lots of RTFMing, and plenty of blind hacking. Let’s start with Wikipedia.

The Wikipedia Problem

Wikipedia is built on top of the MediaWiki software; its content is fully accessible via an API without it needing to build one itself. Great, so what’s the problem?! Go check out the documentation in that API link. Actually accessing Wikipedia content directly involves a lot of hard work. I had imagined it would be a simple case of “query artist name” -> “display article text”. I should try and remember that nothing is ever that simple.

Let’s say we want to use the MediaWiki API to retrieve the content for Sting’s Wikipedia entry, to display it on Sting’s Compendium entry. A read of the docs tells us the URL we’re after looks like http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=123456&rvprop=content&format=xml. The important bit here being the need of a pageid value to retrieve the content.

The only useful query value we do have is the artist’s name “Sting”. A further read of the documentation tells us we can use another API call to search the Wikipedia database with a free-text query. However, take note that Sting’s page on Wikipedia is actually Sting_(musician). Like many articles on Wikipedia, it’s a disambiguated title, used to distinguish identically named articles. There are many articles entitled “Sting”. So how do we know which one to actually retrieve once we’ve managed to get a list of article pageids with the title “Sting”?

The short answer is we can’t. Not without lots of string parsing, munging and making assumptions in code. It’s not really possible to do this without producing a lot of false positives.

To make matters worse, have a look at the data returned from the XML response. All the content is still in wiki format. Even if we were able to pull the exact articles we needed, the content would have to be pushed through a MediaWiki parser before being pushed out into a usable format. I’ve as yet been unable to find a decent standalone wiki parser written in PHP. Please add a link in the comments if you do know of one (Pear isn’t standalone!).

At this point I pretty much gave up… until I happened across the mighty Freebase.

The Freebase Solution

What is Freebase? The official blurb says it’s “a massive, collaboratively-edited database of cross-linked data.”. In essence, it’s an encyclopaedia like Wikipedia but favours facts, relationships and explicit data over written content.

It’s Wikipedia for machines and is a seriously fantastic idea. I’m not sure how I’ve previously manage to miss it. Freebase connects up many external data resources as well as its own data and gives them meaning, structure and relationships. The [open] community pitches in and helps maintain and expand the databases. Metaweb then provides a hugely-featured open API to access this data with its own comprehensive query language to query it with— MQL. While I’m a huge advocate of genuine REST APIs with real RESTful endpoints, the flexibility and potential of the Freebase approach for an open webservice has got me very excited. It’s SOAP, but without all the rubbish it introduces.

So how does it help with this problem of accessing Wikipedia content? As mentioned, Freebase connects many existing data-sets together in a structured manner. Two of these data-sets are Wikipedia and a beardscratcher’s favourite, Musicbrainz. Suddenly one of the world’s largest music databases is unambiguously connected with one of the world’s largest encyclopaedias, providing a huge mine of accurately related and structured information.

Connecting Freebase and Wikipedia

Covering the ins-and-outs of working with the Freebase API is well beyond the scope of this entry, and is expertly covered in the Make section on freebase.com.

In summary, the API has two core calls – database read and database write. Both simply take a single MQL query and return a response. You can experiment with MQL in their handy query editor tool.

As I’m not much of a Sting fan, I’m going to continue this entry with a more interesting artist, My Brightest Diamond. Taking a look at the the Freebase front-end, there’s lots to discover about the artist in the database. Here, we’re only interested in a few small specific pieces of data; namely what is the entry on Wikipedia for the artist “My Brightest Diamond”?

I’m repeating myself now, but I’ll reiterate that Freebase entries are interlinked datasets, and this relationship is formulated (in part) by identifying ‘keys’. Such keys are Freebase object types (like a ‘music artist’ or ‘animal’) or keys from external datasets like the Wikipedia article ID we’re after and the Musicbrainz MBID that uniquely identifies an artist in the Musicbrainz database. The MQL we need to use to query Freebase for an artist’s keys looks like the following:

{
 "query" : {
  "name":"My Brightest Diamond",
  "type":"/music/artist",
  "limit":1,
  "key" : [{
    "namespace" : null,
    "value" : null
   }]
 }
}

This MQL should be fairly self-explanatory. We’re asking for a “/music/artist” with the name “My Brightest Diamond” and want just one result. The null values indicate the properties of the query that we want returned. It’s like saying “Hey Freebase, I’m stuck on a few things. Can you fill in the rest plz?”.

Freebase responds with a number of keys in its response, a subset of which look like:

{
 "namespace": "/authority/musicbrainz",
 "value": "15f835dc-ee52-4b74-b889-113678f54119"
},
{
 "namespace": "/wikipedia/en_id",
 "value": "7490642"
}

Perfect. It appears we have the exact page id to use in a query to Wikipedia for the artist’s entry. What’s also fantastic is that we can actually verify the match by checking the MBID it’s linked to, if we have it available (The Compendium always has the MBID available). There are are a surprising number of artists with identical names!

Finally retrieving Wikipedia content

It doesn’t end there. Recall that I mentioned that the output of the Mediawiki api is wiki-encoded content.

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&pageids=7490642&rvprop=content&format=xml

Screw this approach. Let’s do things old-school, and find an actual wikipedia.org page that uses the page id value and returns something approaching HTML or just plain text. Bingo, printable version. Well, at least someone will be making use of a printable version link (I certainly can’t recall the last time I actually needed to use one).

And that’s it! One accurately matched artist bio. With a little bit of strip_tags() and preg_match() voodoo, and a touch of substr(), extracts of Wikipedia articles now appear on both artist entries and release entries.

Comments for "Doing the Semantic-Web Tango with Wikipedia and Freebase"

Commenting is now closed for this article

About

beardscratchers.com is a music-focused web experiment and creative-arts journal from London, England.

Subscribe/Syndicate

Categories

Previous Entries…

Journal content and design are © of Nick Skelton

built with web standards and a baseline.