One of the key milestones for the MusicNet project is to match up similar records in a number of source catalogues. Specifically we are looking for records that represent the same musical composer in multiple data sets.
As discussed in the report for our first work package (WP1 Data Triage) the reason this task needs to be undertaken is two fold:
- Different naming conventions (e.g. Bach, Johann Sebastian or Bach, J. S.)
- User input errors (e.g. Bach, Johan Sebastien)
Due to the scale of the datasets the obvious approach is to try and minimise the work by computationally looking for matches. This is the approach we attempted when trying to solve this same problem for the musicSpace project (precursor to MusicNet).
Alignment Prototype 1: Alignment API
In musicSpace we used Alignment API to compare composers names as strings. Alignment API calculates the similarity of two strings and produces a metric used to measure how closely they match. It also uses WordNet to enable word stemming, synonym support and so on.
We had interns create a system that would produce Google Doc spreedsheets as outputs of data-set comparisons. The spreedsheet consisted of a column of composer names from one data set, next to a column of composer names from the data set to compare and finally the similarity metric.
The spreedsheets were produced to allow a musicologist with domain knowledge to validate the results. From investigating the results it was quickly clear that this approach had a number of flaws:
Problem 1: Similarity threshold
The threshold where to consider a match to be significantly similar was very hard to define. False positives began to appear with very high scores. This is obvious in the above image where “Smith, John Christopher I” is considered to be the same as “Smith, John Christopher II” purely because there is only one character different.
Problem 2: Proper-nouns
Composer names are proper-nouns and as such don’t work well with WordNets spelling correction. For example a “Bell, Andrew” might not be a spelling mistake of “Tell, Andrew“.
Problem 3: No context
The spreedsheet didn’t offer our Musicologist expert any context for the names. As Alignment API just compares strings we weren’t able to provide additional metadata such as data set source or a link to find more information about the specific composer.
The lack of this additional context proved to be the most crucial, without it there was no way that our expert could say for sure that two strings were indeed a match.
Alignment Prototype 2: Custom Tool
The first attempt was unsuccessful but having learned from the failings of the first attempt we decided to have another go. A completely automated solution was out of the question so we decided to turn our efforts to a custom tool that could quickly enable our expert to find all the relevant matches across data sets.
From our work on the mSpace faceted browser we’ve learnt a lot about data displayed in lists. We’ve learnt that unless filtered “ugly” data tends to float to the top. By “ugly” we refer to data which isn’t generally alphanumeric, for example:
We also know, from our work in creating an mSpace interface on EPrints, that multiple instances of the same name can normally be found very close to one another. This is largely because most sources tend to list the surname first and the ambiguity comes from how they represent the forenames, for example:
With these factors in mind we came up with the following UI for our second attempt at an alignment solution:
The left hand side contains two lists, one for all items that have yet to have been “grouped” and the second which shows all Grouped items. Each list has an A-Z control for quick indexing into the data and can be fully controlled by the keyboard. By selecting multiple items from the ungrouped list the user (at the touch of a single keystroke) can create a group from the selection.
The verified toggle lets the user see either expert-verified or non expert-verified groups. This is included as we provided a basic string matching algorithm in the tool to find likely matches. Unlike the first prototype, the string matches are a lot more rudimentary and hopefully produce fewer false positives (at the cost of fewer over all matches).
The algorithm produces a hash for each string it knows about by first converting all characters with diacritics to their equivalent alpha character. It then removes all non alpha’s from the string and converts the result to lowercase. This proved useful for our datasets as some data-sets suffixed a composers name with their birth and death dates, e.g. Schubert, Franz (1808-1878.). In these cases this would produce exactly the same hash as the non suffixed string and produce a likely candidate for a group.
Expert verification is still required but this algorithm helped present likely groupings to minimise the experts workload.
The right hand side is where the user can see extended metadata about one or more items. This view displays all relevant information that we know about each composer and can be used by the expert to decide if more than one instance are in fact the same.
Metadata Match is an additional visual prompt to help the expert decide whether composers are the same. If multiple composers are selected and their birth or death dates match they will be highlighted in grey. If the user hovers over the date with their mouse, all matching dates are highlighted.
Want to use the tool on your data?
The tool is currently in active development and when it reaches a stable state we will release an official download. It’s relatively generic and should be useful to any 3rd parties looking to perform a similar alignment/cleanup of their own data. If however you’re interested in having a play straight away then head on over to our Google Code site and grab the latest SVN.
Note: we can’t make the complete set of our music data available as we have access to it under a copyright licence, but we will in the near future provide import scripts and some sample data to serve as an example.