RSS
 

Archive for September, 2010

Alignment Tool (Beta Release 1)

30 Sep

There has been quite a bit of interest in the tool we’ve been developing to solve the (mis-)alignment of data across our multiple catalogs.

Today we’re pleased to be able to make a release available for download, albeit in an unsupported beta form!

Download Alignment Tool (beta 1) (Firefox 3.5+ or WebKit only)

Whilst we can’t make the data we’re working on available for download, we’ve created an example data set using names from the ECS Eprints repository. The use-case example here is that you have multiple name representations for specific individuals and you’d like to find the matches. Once installed the alignment tool will list all ‘Authors’ and when requested will present basic information about the papers/deposits that the author string was found on. For justifications on why the additional metadata/context is important be sure to read the previous post about the development process for the tool.

If you’d like to get the latest source-code for the tool as it evolves/improves you can check it out from our SVN repository which is hosted on the project’s Google Code page.

Finally, if you’d like to see what the Alignment tool should look like once you’ve got it installed then take a look at this screencast produced by our resident Musicologist David Bretherton for his recent presentation at AHM2010.

 
 

Governance Model

13 Sep

In conjunction with OSS Watch we’ve started to create a Governance Model for the MusicNet project. The first draft can now be downloaded from the Google Code repository:

Governance Model (Draft v1.0)

If you’d like to contribute to the project then please don’t hesitate to get in contact

 
 

Assisted Manual Data Alignment

08 Sep

One of the key milestones for the MusicNet project is to match up similar records in a number of source catalogues. Specifically we are looking for records that represent the same musical composer in multiple data sets.

As discussed in the report for our first work package (WP1 Data Triage) the reason this task needs to be undertaken is two fold:

  1. Different naming conventions (e.g. Bach, Johann Sebastian or Bach, J. S.)
  2. User input errors (e.g. Bach, Johan Sebastien)

Due to the scale of the datasets the obvious approach is to try and minimise the work by computationally looking for matches. This is the approach we attempted when trying to solve this same problem for the musicSpace project (precursor to MusicNet).

Alignment Prototype 1: Alignment API

In musicSpace we used Alignment API to compare composers names as strings. Alignment API calculates the similarity of two strings and produces a metric used to measure how closely they match. It also uses WordNet to enable word stemming, synonym support and so on.

We had interns create a system that would produce Google Doc spreedsheets as outputs of data-set comparisons. The spreedsheet consisted of a column of composer names from one data set, next to a column of composer names from the data set to compare and finally the similarity metric.

The spreedsheets were produced to allow a musicologist with domain knowledge to validate the results. From investigating the results it was quickly clear that this approach had a number of flaws:

Problem 1: Similarity threshold

The threshold where to consider a match to be significantly similar was very hard to define. False positives began to appear with very high scores. This is obvious in the above image where “Smith, John Christopher I” is considered to be the same as “Smith, John Christopher II” purely because there is only one character different.

Problem 2: Proper-nouns

Composer names are proper-nouns and as such don’t work well with WordNets spelling correction. For example a “Bell, Andrew” might not be a spelling mistake of “Tell, Andrew“.

Problem 3: No context

The spreedsheet didn’t offer our Musicologist expert any context for the names. As Alignment API just compares strings we weren’t able to provide additional metadata such as data set source or a link to find more information about the specific composer.

The lack of this additional context proved to be the most crucial, without it there was no way that our expert could say for sure that two strings were indeed a match.

Alignment Prototype 2: Custom Tool

The first attempt was unsuccessful but having learned from the failings of the first attempt we decided to have another go. A completely automated solution was out of the question so we decided to turn our efforts to a custom tool that could quickly enable our expert to find all the relevant matches across data sets.

From our work on the mSpace faceted browser we’ve learnt a lot about data displayed in lists. We’ve learnt that unless filtered “ugly” data tends to float to the top. By “ugly” we refer to data which isn’t generally alphanumeric, for example:

We also know, from our work in creating an mSpace interface on EPrints, that multiple instances of the same name can normally be found very close to one another. This is largely because most sources tend to list the surname first and the ambiguity comes from how they represent the forenames, for example:

With these factors in mind we came up with the following UI for our second attempt at an alignment solution:

Ungrouped/Grouped Lists

The left hand side contains two lists, one for all items that have yet to have been “grouped” and the second which shows all Grouped items. Each list has an A-Z control for quick indexing into the data and can be fully controlled by the keyboard. By selecting multiple items from the ungrouped list the user (at the touch of a single keystroke) can create a group from the selection.

Verified Toggle

The verified toggle lets the user see either expert-verified or non expert-verified groups. This is included as we provided a basic string matching algorithm in the tool to find likely matches. Unlike the first prototype, the string matches are a lot more rudimentary and hopefully produce fewer false positives (at the cost of fewer over all matches).

The algorithm produces a hash for each string it knows about by first converting all characters with diacritics to their equivalent alpha character. It then removes all non alpha’s from the string and converts the result to lowercase. This proved useful for our datasets as some data-sets suffixed a composers name with their birth and death dates, e.g. Schubert, Franz (1808-1878.). In these cases this would produce exactly the same hash as the non suffixed string and produce a likely candidate for a group.

Expert verification is still required but this algorithm helped present likely groupings to minimise the experts workload.

Context/Extra Metadata

The right hand side is where the user can see extended metadata about one or more items. This view displays all relevant information that we know about each composer and can be used by the expert to decide if more than one instance are in fact the same.

Metadata Match

Metadata Match is an additional visual prompt to help the expert decide whether composers are the same. If multiple composers are selected and their birth or death dates match they will be highlighted in grey. If the user hovers over the date with their mouse, all matching dates are highlighted.

Want to use the tool on your data?

The tool is currently in active development and when it reaches a stable state we will release an official download. It’s relatively generic and should be useful to any 3rd parties looking to perform a similar alignment/cleanup of their own data. If however you’re interested in having a play straight away then head on over to our Google Code site and grab the latest SVN.

Note: we can’t make the complete set of our music data available as we have access to it under a copyright licence, but we will in the near future provide import scripts and some sample data to serve as an example.

 
18 Comments

Posted in Software