RSS
 

MusicNet URI scheme and Linked Data hosting

19 Jan

MusicNet’s key contribution is the minting of authoritative URIs for musical composers, that link to records for those composers in different scholarly and commercial catalogues and collections. MusicNet claims authority because the alignment across the sources has been performed by scholars in musicology. The alignment tool and the progress to date has been detailed previously. In this post I will overview our methodology for publishing our work, in terms of the decisions made in choosing our URI scheme and how we model the information using RDF in the exposed Linked Data. I will then describe the architecture for generating the linked data, which has been designed to be easily deployed and maintained, so that it can be hosted centrally in perpetuity by a typical higher education computer science department.

URI Scheme

The URI scheme is designed to expose minimal structural information, for example, the URI for Franz Schubert is currently (see below for a volatility note):

http://musicnet.mspace.fm/person/7ca5e11353f11c7d625d9aabb27a6174#id

It is comprised of the domain name (musicnet.mspace.fm), an abstract type (person), an ID taken from the musicSpace hash of the composer (7ca5e11353f11c7d625d9aabb27a6174) and a fragment to differentiate the document from the person (#id).

We have chosen a hash rather than a human-readable label because we want to avoid people using the URI because they think that it refers to a composer when it might refer to a different composer. This is important in this domain because there are a number of composers with the same or similar names. Part of the alignment process has musicologists make this distinction. By forcing people to resolve the URI and check that it is the person they are referring to, we aim to avoid incorrect references being made. In addition it gives us the freedom to alter the canonical label for a composer after we have minted the URI, so that we don’t have a label-based URI with a different label in its metadata.

Domain Name

We intend for the domain name to change soon from one which isn’t explicitly tied to mSpace – this is in place right now for convenience to us. In particular our requirements are a domain that will not cost us anything to re-register in future, will remain in our control (i.e. not get domain parked if someone forgets to renew), and will not dissuade people from using it for any partisan or political reasons. The closest we might reasonably get is musicnet.data.ac.uk, although this is still unconfirmed at this point in time, and we may have to instead use musicnet.soton.ac.uk or musicnet.ecs.soton.ac.uk, which are not preferred, since they might give the impression that the data is a Southampton-centric view of the information, which it is not. For a more in depth discussion of a proposed solutions see our previous posts (data.ac.uk proposal & data.ac.uk revisited)

Ontological Constructs

In addition to the scheme for the URI, we also had to determine the best way to expose the data in terms of the ontological constructs (specifically the class types and predicates) used in the published RDF. We are fortunate that an excellent set of linked data in the musical composer domain already exists, in the form of the BBC /music linked data. For example, the BBC /music site exposes Franz Schubert with the URI:

http://www.bbc.co.uk/music/artists/f91e3a88-24ee-4563-8963-fab73d2765ed#artist

The BBC’s data uses the Music Ontology heavily, as well as other ontologies such as SKOS, Open Vocab and FOAF. Since we are publishing similar data, it makes sense for us to use the same terms and predicates as they do where possible, which is what we have done.

We are still in the process of finalising how we will model the different labels of composers. In the figure below we offer two possible methods, the first is to create a URI for each composer for every catalogue that they are listed in, publishing the label from that catalogue under the new catalogue-based URI, and use owl:sameAs to link it to our canonical MusicNet one. The second method is to “flatten” all labels as simple skos:altLabel links, although this method loses provenance. Currently we do both, and we’ve not finalised whether this is necessary or useful.

 

RDF model for MusicNet alternative labels

RDF model for MusicNet alternative labels

 

 

Content Negotiation & Best Practice

Similarly, we also follow the BBC /music model of using HTTP 303 content negotiation to serve machine-readable RDF and human-readable HTML from the same URI. Specifically, the model we’ve borrowed is to append “.rdf” when forwarding to the RDF view of the data, and to append “.html” when forwarding to the human readable view of the data. This is now implemented, and you can try this out yourself with the above URIs, which you can turn into the following:

http://musicnet.mspace.fm/person/7ca5e11353f11c7d625d9aabb27a6174.rdf
http://musicnet.mspace.fm/person/7ca5e11353f11c7d625d9aabb27a6174.html
http://www.bbc.co.uk/music/artists/f91e3a88-24ee-4563-8963-fab73d2765ed.rdf
http://www.bbc.co.uk/music/artists/f91e3a88-24ee-4563-8963-fab73d2765ed.html

There are several other offerings from the MusicNet site, some of which have been detailed before. First, the MusicNet Codex, which is the human search engine for MusicNet. In addition we have also created a (draft!) VoiD document that describes the MusicNet data set, available here:

http://musicnet.mspace.fm/void#

The perceptive among you will notice that the VoiD document links to an RDF dump of all of the individual linked data files, available here (14MB at time of writing):

http://musicnet.mspace.fm/dump.rdf#

Simple Deployment & Hosting

As noted above, our requirements state that our deployment must be as simple as possible to maintain by typical higher education computer science department web admins. In our bid we stated that we will work with the Southampton ECS Web Team to tweak our solution. As such, in order to keep our deployment simple, we have adopted an architecture where all RDF (including the individual Linked Data files for each composer) are generated once and hosted statically. The content negotiation method (mentioned above) makes serving static RDF files simple and easy to understand by web admins that might not know much about the Semantic Web. Similarly, the VoiD document and RDF dump get generated at the same time. The content negotiation is handled by a simple PHP script and some Apache URL rewriting.

Benefits of Linked Data

One of the benefits of using Linked Data is that we can easily integrate metadata from different sources. One of the ways in which we use this is using the aforementioned BBC /music linked data. Specifically, we enrich our Linked Data offering through the use of MusicBrainz. One of the sources of metadata that we have aligned is musicbrainz, based on a data dump we were given by the LinkedBrainz project team. The BBC also have aligned their data to Musicbrainz, and thus we have been able to automatically cross-reference the composers at the BBC with the composers in MusicNet. Thus, we can link directly to the BBC, which offers a number of benefits. Firstly, it means that users can access BBC content, such as recently radio and television recordings that feature those composers (see the Franz Schubert link above, for examples), but also that we can harvest some of the BBC’s outward links in order to enrich our own Linked Data offering. Specifically, we have harvested links that the BBC make to pages on IMDB, DBPedia, Wikipedia, among others, which we now re-publish.

The data flow from the raw data sources to linked data serving is illustrated in the figure below.

MusicNet Architecture Data Flow Diagram

MusicNet Data Flow Diagram

Future Work

The following tasks remain in this area of the project:

  1. Acquire control of a long-term domain name (preferably musicnet.data.ac.uk, see above).
  2. Discuss our RDF model with experts in Linked Data, Ontological Modelling and Provenance.
  3. Determine if we will offer a SPARQL endpoint in future. If we decide not to ourselves (because it might not be sustainable once our hosting is passed over to the department), it might be desirable to put the data on the Data Incubator SPARQL host.

This post documents Work Package 3 from the MusicNet project deliverables. MusicNet is funded through the JISCEXPO programme.

 
32 Comments

Posted by Daniel Alexander Smith in Documentation, Software

 

End of year roundup

30 Dec

Its been a busy 2010 for the MusicNet project and we’ve made great progress. Our Alignment Tool is now becoming more mature and the latest code is showing significant increases in task speed, making the workflow much more efficient. We expect to be able to release Beta 2 in the new year, so keep checking back for more details.

Alignment Progress (Work Package 4)

The performance and usability improvements to our Alignment Tool (Work Package 2) have had a dramatic effect on our overall alignment progress. We are now at 56% complete, which places us firmly on target to complete the entire dataset before the proposed deadline (end of March 2011).

Codex/User Portal (Work Package 6.2)

Work has also begun on the MusicNet Codex, which aims to be a single source of search for Musicologists to find information and links into our datapartners catalogs. Although this is in the very early beta stages, it is functional and we are adding more composers as and when they are aligned.

Visit the beta of the MusicNet Codex: http://codex.musicnet.mspace.fm

The Codex publicly demonstrates for the first time the outputs of the Alignment Tool and shows the integration with the LinkedBrainz project (read about our meetup with the LinkedBrainz project).

Please feel free to leave any feedback on the Codex in the comments.

Linked Data (Work Package 5)

Work is underway to convert the output of the Alignment Tool into usable Linked Data. In the New Year we plan to release our proposed URI Scheme (Work Package 3) and also expose an early version of our data/alignment as Linked Data.

 
3 Comments

Posted by Joe Lambert in Documentation

 

Performance & Usability Improvements

09 Dec

Lately we’ve been working hard to improve the workflow of our Alignment Tool. Based on the real user experience of the musicologist using the tool daily, we were able to implement some simple performance and UX (User Experience) updates that have had a dramatic effect on efficiency.

The improvements we’ve made and the effect they’ve had on the workflow are outlined below – although some of these may seem simple, it’s only in hindsight and after real-world user testing that the need for such tweaks becomes apparent.

Starting Point

When we released Beta 1 of our tool, we ran some benchmarks to get a sense of how quickly the alignment task could be achieved. Initially we found that the rate at which you could create verified matches was: 135/hr.

Performance Improvements, Phase 1

Alphabetically auto-sort newly created groups

In Beta 1, newly created groups were pushed to the bottom of the groups list so as to reduce the need to re-sort and re-render a potentially large list – a procedure that typically doesn’t perform well in a Javascript environment.  However, this is problematic from a UX perspective: after creating a new group, the alphabetical sorting of the grouped item column is corrupted, which makes comparing its contents to the alphabetically sorted ungrouped column more time consuming.

Happily, recent improvements to Javascript engines and the uptake of ECMAScript 5 functions, such as Array.sort(callback), allow the browser itself to perform the re-ordering rather than us coding it in a Javascript routine. By altering the behaviour of the alignment tool so that groups added to the grouped item column are now listed in their correct alphabetical position, we were able to improve the user experience and remove the difficulty in comparing the contents of the ungrouped and grouped item columns. In testing the change we were unable to make the browser stall or freeze during the re-sort/re-render and did not notice a reduction in interface speed or responsiveness.

Scroll to newly created groups

In Beta 1 this action was the default as we knew that the new group would be at the bottom of the list. In changing to the re-sorting model we needed to work out where the newly created group resided in the list and then scroll the list to make sure the element was visible in the lists viewport.

As it turns out this was quite a simple process, the server returns the ID of the newly created group which allows us to find the element on the DOM after its been sorted. We can then work out how much to scroll the list based on the newly created group elements offsetTop property.

We also added a small Javascript ‘blink’ animation to draw the users attention to the newly created group.

Highlight List Items

When examining the associated metadata in the right hand metadata view pane for a group that has been suggested by the system, sometimes items’ metadata might be ordered differently to how items were listed in the column. Using the Beta 1 code this posed a problem if the entries labels were all identical, as it meant there was no way to identify which metadata belonged to which list item.

To solve this we added a hover effect in the metadata view pane which highlighted the associated list item, allowing for much quicker and accurate removal of single items.

Phase 1 Improvements led to a new match rate of 169/hr (25% improvement on Beta 1)

Performance Improvements, Phase 2

Fix Diacritics

A lot of the tooling we used to generate the datafiles we required as input to Alignment Tool didn’t seem to handle diacritics as well as expected. Specifically all the composers we had imported from the Grove database seemed to have escape characters placed in front of any diacritic. The diacritic remained in tact but there were just extra characters in the string.

We programmatically removed these characters to aid readability during the alignment process.

Create a Merge function

One feature we were missing in Beta 1 that we anticipated we might need was the ability to merge two or more groups into one. The most common use cases where this was required are (i) where the system generates two different groups for the same composer based on two re-occuring variations in name usage, or (ii) where the user creates a new group for ungrouped items, before realising that a suitable group for these items already existed.

This function has now been added and can be found in the Alignment Tool SVN repository on Google Code.

Phase 2 Improvements led to a new rate of 279/hr (106% improvement on Beta 1)

 
19 Comments

Posted by Joe Lambert in Documentation

 

MusicNet at SDH 2010

12 Nov

Austrian Parliament Building

Last month I was very pleased to be able to present the work of the musicSpace and MusicNet research teams at the Supporting the Digital Humanities 2010 conference (Vienna, 19-20 October 2010), which was jointly organized by CLARIN and DARIAH. The musicology session was convened by PhD student Richard Lewis, and also featured presentations by Alan Marsden and Frans Wiering. Our paper explained how the motivation for MusicNet came out of our previous work on the musicSpace project.

Please download the slides from our presentation below, and take a look at the other presentations from the conference at http://www.dariah.eu/index.php?option=com_docman&Itemid=200.

 
No Comments

Posted by David Bretherton in Dissemination

 

MusicNet & LinkedBrainz Meetup

25 Oct

Overview

Last Friday the MusicNet team headed to QMUL to meet with Kurt Jacobson & Simon Dixon from the LinkedBrainz project. LinkedBrainz is also funded by the JISC Expose (#jiscexpo) programme and is working, in conjunction with MusicBrainz, to produce an official Linked Data mapping for the MusicBrainz database. You can follow their progress on the project blog at http://linkedbrainz.c4dmpresents.org/.

What does a collaboration look like?

As well as learning a bit more about each others projects we were able to look at a few ways in which we might be able to collaborate over the coming months. We also came away with a data export from the most recent MusicBrainz database for all Classical musicians. This will essentially allow us to link our exposed composer URIs directly to the MusicBrainz (or LinkedBrainz) equivalents. This will greatly increase the utility of our URIs, especially as organisations such as the BBC are already using the MusicBrainz IDs.

Adding “same-as” links to LinkedBrainz is only one side of the solution, ideally it would be great if we could convince the MusicBrainz community to provide the reverse linking. This is likely to be a longer term outcome and one we should approach once the sustainability of URIs issue has been resolved (data.ac.uk?).

How will we align our URIs to LinkedBrainz?

We’ll use out custom built Alignment Tool! Over the last few months we’ve spent quite a while engineering the tool and making sure its as re-usable as possible, we plan to add the LinkedBrainz data as though it were just another partner’s catalog. This means that once our Musicology expert has performed the alignment we’ll not only know the overlaps between our partners catalogs but we’ll also know how they map to MusicBrainz and by proxy to Wikipedia & the BBC etc.

 
22 Comments

Posted by Joe Lambert in Uncategorized

 

Alignment Tool Implementation

19 Oct

In this post we’ll discuss a little about the implementation of the relatively simple server component of the Alignment Tool. You can read more about the tool in previous posts (Beta Release, Assisted Manual Data Alignment), or download the source yourself and have a play.

Server Application Component

Our servers run a typical LAMP (Linux, Apache, MySQL, PHP) stack & although it can also run python, perl & ruby we decided that due to the experience of the project team we would develop the server component in PHP. Usually when we need to write a PHP driven application we would reach for the Kohana Framework.

Kohana is an elegant HMVC PHP5 framework that provides a rich set of components for building web applications

HMVC (or Hierarchical-MVC) is an extension to the more commonly used MVC (Model View Controller). HMVC is essentially useful to help build more modular “widgets” that make up a webpage, we won’t be discussing this as it doesn’t serve our purposes for MusicNet.

In MVC, each object in a system is separated into one of the following groups:

  1. Model: Objects which make up the datastructures used in the system
  2. View: Typically the UI
  3. Controller: Where application specific code is implemented

MVC allows for proper code separation and makes for easier design and maintenance.

Using Kohana, each HTTP request to the server is interpretted as a method call on a constructor object. For example:

http://myserver.com/api/get_tags

This URL equates to calling the public function get_tags() on the controller object Api.

Lightweight PHP Framework

We felt that requiring the Kohana Framework for the server component of the Alignment Tool was a bit heavyweight but still wanted the flexibility of the a lightweight MVC architecture in which to quickly code the AJAX API used by the the Javascript Client. So taking inspiration from Kohana’s URL interpreting we wrote a lightweight framework of our own.

To achieve this we needed 3 distinct parts:

  1. URL Interpreting
  2. Controller Object
  3. Abstract UI Rendering

URL Interpreting

To enable Kohana style requests we first needed to route all URL requests through a single PHP gateway script. As we’re using the LAMP stack this is easily done using ModRewrite. Our .htaccess file in our /ajax folder looks like this:

# Turn on URL rewriting
RewriteEngine On
 
# Installation directory
RewriteBase /
 
# Allow any files or directories that exist to be displayed directly
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
 
# Rewrite all other URLs to index.php/URL
RewriteRule .* /ajax/index.php/$0 [PT,L]

The key part to this script is the last line, here we tell Apache to send all requests that are made below the ajax folder to the index file.

For our purposes we only need a single controller object so our bootstrapping code (index.php) only needs to work out the method/action and the arguments. Our bootstrap script looks like this:

require_once('ajax.php');
 
$path = $_SERVER['PATH_INFO'];
 
$parts = explode('/', trim($path, '/'));
 
if(count($parts))
{
   $method = array_shift($parts);
   $args = $parts;
}
 
$ajax = new Ajax();
call_user_func_array(array($ajax, $method), $args);
$ajax->output();

Controller Object

Now that all the routing is taken care of we just need a simple Controller class with a public function for each method/action in our API:

class Ajax
{
   private $status = 200;
   private $message = "Success";
   private $data = array();
 
   // Fetch all Ungrouped items
   public function ungrouped()
   { }
 
   // Fetch all Grouped items
   public function grouped()
   { }
}

We also need an output() function as this is what the bootstrap script calls to send output to the client.

public function output()
{
   if($this->status == 404)
      header('HTTP/1.0 404 Not Found');
 
   $output = array(
      "status"	=> $this->status,
      "message"	=> $this->message,
      "data"		=> $this->data,
   );
 
   header("Content-Type: application/json");
   echo json_encode($output);
}

And its a good idea to implement the __call method incase the client makes an unrecognised request:

public function __call($method, $args)
{
   $this->status = 404;
   $this->message = "Unknown method: '$method'";
}

Abstract UI Rendering

The final piece of the system is to enable abstract UI rendering. In one of the calls in the Alignment Tool, the server is required to return HTML rather than JSON. To remove this rendering from the Controller class and to enable 3rd parties (we hope the Alignment Tool will be useful to others too!) to write their own Views for their own data we use PHP’s Output buffering:

ob_start();
 
include_once("views/musicnet.php");
 
$html = ob_get_contents();
ob_end_clean();

By including the file in this way the View script has all the same variable scope as the method in the Controller object. Here’s an extract from our View file to give an idea of how it can be used.

<?php foreach($this->data as $item): ?>
	<div class="item" id="info-<?=$item->id?>">
		<h1><?=$item->label?></h1>
		<ul class="metadata">
			<?php if(isset($item->metadata->Birth_Date)): ?>
				<li><span class="title">Birth Date</span><?=$item->metadata->Birth_Date?></span></li>
			<?php endif; ?>
			<?php if(isset($item->metadata->Death_Date)): ?>
				<li><span class="title">Death Date</span><?=$item->metadata->Death_Date?></span></li>
			<?php endif; ?>
		</ul>
	</div>
<?php endforeach; ?>
 
9 Comments

Posted by Joe Lambert in Uncategorized

 

MusicNet at AHM 2010

13 Oct
City Hall, Cardiff

City Hall, Cardiff

Last month at the UK e-Science All Hands Meeting 2010 at City Hall in Cardiff (13-16 September 2010), we gave our first conference paper about the MusicNet project. Thank you to everyone that came to our session and asked questions. It was informative to learn that many other delegates have encountered datasets (across a range of subjects, from geography to chess!) in which synonymous entities are not aligned; precisely the problem which the alignment tool we are building for MusicNet aims to address.

A short abstract of our paper is given below, but please also take a look at the extended abstract and our presentation slides:

Thank you to the organising committee and administrators for making the event run so smoothly.

ABSTRACT: The MusicNet Composer URI Project
Daniel Alexander Smith, David Bretherton, Joe Lambert, and mc schraefel

In any domain, a key activity of researchers is to search for and synthesize data from multiple sources in order to create new knowledge. In many cases this process is laborious, to the point of making certain questions nearly intractable because the cost of the searches outstrips the time available to complete the research. As more resources are published as Linked Data, data from multiple heterogeneous sources should be more rapidly discoverable and automatically integrable, enabling previously intractable queries to be explored, and standard queries to be significantly accelerated for more rapid knowledge discovery. But Linked Data is not of itself a complete solution. One of the key challenges of Linked Data is that its strength is also a weakness: anyone can publish anything. So in classical music, for instance, 17 sources may publish data about ‘Schubert’, but there is no de facto way to know that any of these Schuberts are the same, because the sources are not aligned. Without alignment, much of the benefit of Linked Data is diminished: resources can effectively be stranded rather than discovered, or tangled nets of only guessed at associations in a particular dataset can end up costing more than their value to untangle.

The MusicNet project, which emerged out of Southampton’s musicSpace project, is set to address the challenge just outlined by “minting” URIs for key musicology assets to provide a framework for the effective exploration of Linked Data about classical music. Unique URIs will be minted for each composer that exists in our data partners’ datasets. Basic biographical data will also be exposed, as well as name variants in different sources to allow for compatibility with legacy data. Crucially, this information will be curated by domain experts so that MusicNet will become a reliable source of data about the names of classical music composers. However, the real benefit of this work is that it will align identifiers across data sources, which is a prerequisite for the creation of Linked Data classical music and musicology resources, if such resources are to be optimally useful and usable.

The establishment of authoritative URIs for composers, and moreover the disambiguation of composers in online data sources that will flow from this, is an essential first step in the provision of Linked Data services for classical music and musicology. Our work will provide a model and tools that can usefully be employed elsewhere.

 
18 Comments

Posted by David Bretherton in Dissemination

 

Alignment Tool (Beta Release 1)

30 Sep

There has been quite a bit of interest in the tool we’ve been developing to solve the (mis-)alignment of data across our multiple catalogs.

Today we’re pleased to be able to make a release available for download, albeit in an unsupported beta form!

Download Alignment Tool (beta 1) (Firefox 3.5+ or WebKit only)

Whilst we can’t make the data we’re working on available for download, we’ve created an example data set using names from the ECS Eprints repository. The use-case example here is that you have multiple name representations for specific individuals and you’d like to find the matches. Once installed the alignment tool will list all ‘Authors’ and when requested will present basic information about the papers/deposits that the author string was found on. For justifications on why the additional metadata/context is important be sure to read the previous post about the development process for the tool.

If you’d like to get the latest source-code for the tool as it evolves/improves you can check it out from our SVN repository which is hosted on the project’s Google Code page.

Finally, if you’d like to see what the Alignment tool should look like once you’ve got it installed then take a look at this screencast produced by our resident Musicologist David Bretherton for his recent presentation at AHM2010.

 
119 Comments

Posted by Joe Lambert in Software

 

Governance Model

13 Sep

In conjunction with OSS Watch we’ve started to create a Governance Model for the MusicNet project. The first draft can now be downloaded from the Google Code repository:

Governance Model (Draft v1.0)

If you’d like to contribute to the project then please don’t hesitate to get in contact

 
1 Comment

Posted by Joe Lambert in Documentation

 

Assisted Manual Data Alignment

08 Sep

One of the key milestones for the MusicNet project is to match up similar records in a number of source catalogues. Specifically we are looking for records that represent the same musical composer in multiple data sets.

As discussed in the report for our first work package (WP1 Data Triage) the reason this task needs to be undertaken is two fold:

  1. Different naming conventions (e.g. Bach, Johann Sebastian or Bach, J. S.)
  2. User input errors (e.g. Bach, Johan Sebastien)

Due to the scale of the datasets the obvious approach is to try and minimise the work by computationally looking for matches. This is the approach we attempted when trying to solve this same problem for the musicSpace project (precursor to MusicNet).

Alignment Prototype 1: Alignment API

In musicSpace we used Alignment API to compare composers names as strings. Alignment API calculates the similarity of two strings and produces a metric used to measure how closely they match. It also uses WordNet to enable word stemming, synonym support and so on.

We had interns create a system that would produce Google Doc spreedsheets as outputs of data-set comparisons. The spreedsheet consisted of a column of composer names from one data set, next to a column of composer names from the data set to compare and finally the similarity metric.

The spreedsheets were produced to allow a musicologist with domain knowledge to validate the results. From investigating the results it was quickly clear that this approach had a number of flaws:

Problem 1: Similarity threshold

The threshold where to consider a match to be significantly similar was very hard to define. False positives began to appear with very high scores. This is obvious in the above image where “Smith, John Christopher I” is considered to be the same as “Smith, John Christopher II” purely because there is only one character different.

Problem 2: Proper-nouns

Composer names are proper-nouns and as such don’t work well with WordNets spelling correction. For example a “Bell, Andrew” might not be a spelling mistake of “Tell, Andrew“.

Problem 3: No context

The spreedsheet didn’t offer our Musicologist expert any context for the names. As Alignment API just compares strings we weren’t able to provide additional metadata such as data set source or a link to find more information about the specific composer.

The lack of this additional context proved to be the most crucial, without it there was no way that our expert could say for sure that two strings were indeed a match.

Alignment Prototype 2: Custom Tool

The first attempt was unsuccessful but having learned from the failings of the first attempt we decided to have another go. A completely automated solution was out of the question so we decided to turn our efforts to a custom tool that could quickly enable our expert to find all the relevant matches across data sets.

From our work on the mSpace faceted browser we’ve learnt a lot about data displayed in lists. We’ve learnt that unless filtered “ugly” data tends to float to the top. By “ugly” we refer to data which isn’t generally alphanumeric, for example:

We also know, from our work in creating an mSpace interface on EPrints, that multiple instances of the same name can normally be found very close to one another. This is largely because most sources tend to list the surname first and the ambiguity comes from how they represent the forenames, for example:

With these factors in mind we came up with the following UI for our second attempt at an alignment solution:

Ungrouped/Grouped Lists

The left hand side contains two lists, one for all items that have yet to have been “grouped” and the second which shows all Grouped items. Each list has an A-Z control for quick indexing into the data and can be fully controlled by the keyboard. By selecting multiple items from the ungrouped list the user (at the touch of a single keystroke) can create a group from the selection.

Verified Toggle

The verified toggle lets the user see either expert-verified or non expert-verified groups. This is included as we provided a basic string matching algorithm in the tool to find likely matches. Unlike the first prototype, the string matches are a lot more rudimentary and hopefully produce fewer false positives (at the cost of fewer over all matches).

The algorithm produces a hash for each string it knows about by first converting all characters with diacritics to their equivalent alpha character. It then removes all non alpha’s from the string and converts the result to lowercase. This proved useful for our datasets as some data-sets suffixed a composers name with their birth and death dates, e.g. Schubert, Franz (1808-1878.). In these cases this would produce exactly the same hash as the non suffixed string and produce a likely candidate for a group.

Expert verification is still required but this algorithm helped present likely groupings to minimise the experts workload.

Context/Extra Metadata

The right hand side is where the user can see extended metadata about one or more items. This view displays all relevant information that we know about each composer and can be used by the expert to decide if more than one instance are in fact the same.

Metadata Match

Metadata Match is an additional visual prompt to help the expert decide whether composers are the same. If multiple composers are selected and their birth or death dates match they will be highlighted in grey. If the user hovers over the date with their mouse, all matching dates are highlighted.

Want to use the tool on your data?

The tool is currently in active development and when it reaches a stable state we will release an official download. It’s relatively generic and should be useful to any 3rd parties looking to perform a similar alignment/cleanup of their own data. If however you’re interested in having a play straight away then head on over to our Google Code site and grab the latest SVN.

Note: we can’t make the complete set of our music data available as we have access to it under a copyright licence, but we will in the near future provide import scripts and some sample data to serve as an example.

 
18 Comments

Posted by Joe Lambert in Software