RSS
 

Posts Tagged ‘semanticweb’

Add MusicNet data to COPAC with ‘Composed’ bookmarklet

03 Oct

Here’s a great example of how the MusicNet data can be used to enhance existing sites. The ‘Composed’ bookmarklet decorates an existing COPAC composer record with all the extra information that MusicNet contains about that person.

Head on over to this blog post for more details. Incidentally this was created as an entry to the UK Discovery competition we blogged about earlier in the year.

So, who’s going to turn this into a GreaseMonkey script so that the bookmarklet isn’t need?

 
 

UK Discovery Developer Competition features the MusicNet dataset

11 Jul

The MusicNet dataset has been included as part of the UK Discovery global developer competition. The rules of the competition are simple, build an app/tool that makes use of at least one of the 10 featured datasets.

UK Discovery is working with libraries, archives and museums to open up data about their resources for free re-use and aggregation. DevCSIis working with developers in the education sector, many of who will have innovative ideas about how to exploit this open data in new applications.

This Developer Competition runs throughout July 2011. It starts on Monday 4 July – Independence Day, a good day for liberating data – and closes on Monday 1 August. It’s open to anyone anywhere in the world.

For more information about the competition see http://discovery.ac.uk/developers/competition/. Prizes are available for the best entrants, competition ends Monday 1 August 2011.

 
 

Final Product Post: MusicNet & The Alignment Tool

29 Jun

This is a final report and roundup of the MusicNet project. We’ll mainly be discussing the primary outputs of the project but will also cover an overview of the project as a whole.

We have two primary prototypal outputs/products from the project, they are:

  1. The Alignment Tool
  2. The MusicNet Codex

We’ll discuss each of these in turn and address what they are, who they are for and how you can use them in your own projects.

Read the rest of this entry »

 

MLDW Slides

22 May

We are pleased to post below the slides from the presentations at the Music Linked Data Workshop (JISC, London, 12 May 2011). Thank you to our presenters for providing their slides.

 

MLDW Programme & Abstracts

10 May

Music Linked Data Workshop, JISC, London, 12 May 2011

Programme

10:30 – Welcome

Morning Session: Research Papers
Chaired by Richard Polfreman (Music, University of Southampton)

10:35 – MusicNet: Aligning Musicology’s Metadata
David Bretherton, Daniel Alexander Smith, Joe Lambert and mc schraefel (Music, and Electronics and Computer Science, University of Southampton)

11:05 – Towards Web-Scale Analysis of Musical Structure
J. Stephen Downie (Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign), David De Roure (Oxford e-Research Centre, University of Oxford) and Kevin Page (Oxford e-Research Centre, University of Oxford)

11:35 – LinkedBrainz Live
Simon Dixon, Cedric Mesnage and Barry Norton (Centre for Digital Music, Queen Mary University of London)

12:05 – BBC Music – Using the Web as our Content Management System
Nicholas Humfrey (BBC)

12:35 – Lunch

Afternoon Session: Funding & Project Presentations
Chaired by David Bretherton (Music, University of Southampton)

13:30 – JISC Funding Roadmap for 2011-12
David Flanders (JISC)

13:45 – Early Music Online: Opening up the British Library’s 16th-Century Music Books
Sandra Tuppen (British Library)

14:00 – Musonto – A Semantic Search Engine Dedicated to Music and Musicians
Jean-Philippe Fauconnier (Université Catholique de Louvain, Belgium) and Joseph Roumier (CETIC, Belgium)

14:15 – Listening to Movies – Creating a User-Centred Catalogue of Music for Films
Charlie Inskip (freelance music consultant)

14:30 – Q & A and Discussion Session
Chaired by Geraint Wiggins (Department of Computing, Goldsmiths, University of London)

16:00 – End

Abstracts for Morning Research Papers

MusicNet: Aligning Musicology’s Metadata

David Bretherton, Daniel Alexander Smith, Joe Lambert and mc schraefel (Music, and Electronics and Computer Science, University of Southampton)

As more resources are published as Linked Data, data from multiple heterogeneous sources should be more rapidly discoverable and automatically integrable, enabling it to be reused in contexts beyond those originally envisaged. But Linked Data is not of itself a complete solution. One of the key challenges of Linked Data is that its strength is also a weakness: anyone can publish anything. So in music, for instance, 17 sources may independently publish data about ‘Schubert’, but there is no de facto way to know that any of these Schuberts are the same, because the sources are not aligned. Alignment is a prerequisite for usable Linked Data, without which resources are effectively stranded rather than integrated. To begin to address this, the MusicNet project has minted URIs for composers, and has published as RDF basic biographical data and – crucially – alignment information for several leading providers of musicological data.

Towards Web-Scale Analysis of Musical Structure

J. Stephen Downie (Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign), David De Roure (Oxford e-Research Centre, University of Oxford) and Kevin Page (Oxford e-Research Centre, University of Oxford)

SALAMI (Structural Analysis of Large Amounts of Music Information) is an ambitious computational musicology project which applies a computational approach to the huge volume of digital recordings now available from such sources as the Internet Archive. It aims to deliver a very substantive corpus of musical analyses in a common framework for use by music scholars, students and beyond, and to establish a web-based methodology and tooling which will enable others to add to this in the future. In its first phase the project has conducted a significant exercise in ground truth collection with 1000 recordings analysed by music students and shortly to be published as open Linked Data.

LinkedBrainz Live

Simon Dixon, Cedric Mesnage and Barry Norton (Centre for Digital Music, Queen Mary University of London)

The MusicBrainz dataset is a large open (openly-licensed and open to contribution) collection of metadata about music, containing information on artists, their recorded works, and acoustic fingerprints. The LinkedBrainz project aims at making MusicBrainz Linked Data compliant. Linked Data principles require that the data is made available using an RDF serialisation over HTTP, and that this is interlinked with existing datasets. Linked Data Best Practice encourages an endpoint where queries can be made using the SPARQL query language for RDF. The LinkedBrainz project is rolling out an RDFa annotation of the relevant MusicBrainz pages, and is preparing a SPARQL endpoint and RDF-based dereferencing.  In this talk we will give further details on progress and future work, and will show the utility of the dataset as Linked Data by demonstrating the ease with which ‘mash-ups’ can be formed, based on interlinkage with resources such as DBPedia and BBC Music Reviews.

BBC Music – Using the Web as our Content  Management System

Nicholas Humfrey (BBC)

The BBC Music site provides a page for every artist played on the BBC. These pages use persistent web identifiers for each artist, which serve as an aggregation point for all content and information. By reusing structured data available elsewhere on the Web – the Web becomes our Content Management System. The core metadata is then enhanced with content, such as videos and reviews, from the BBC, thereby providing a compelling audience proposition and also making the BBC content re-aggregatable by other websites, thus contributing to the web as a whole.

 

Assisted Manual Data Alignment

08 Sep

One of the key milestones for the MusicNet project is to match up similar records in a number of source catalogues. Specifically we are looking for records that represent the same musical composer in multiple data sets.

As discussed in the report for our first work package (WP1 Data Triage) the reason this task needs to be undertaken is two fold:

  1. Different naming conventions (e.g. Bach, Johann Sebastian or Bach, J. S.)
  2. User input errors (e.g. Bach, Johan Sebastien)

Due to the scale of the datasets the obvious approach is to try and minimise the work by computationally looking for matches. This is the approach we attempted when trying to solve this same problem for the musicSpace project (precursor to MusicNet).

Alignment Prototype 1: Alignment API

In musicSpace we used Alignment API to compare composers names as strings. Alignment API calculates the similarity of two strings and produces a metric used to measure how closely they match. It also uses WordNet to enable word stemming, synonym support and so on.

We had interns create a system that would produce Google Doc spreedsheets as outputs of data-set comparisons. The spreedsheet consisted of a column of composer names from one data set, next to a column of composer names from the data set to compare and finally the similarity metric.

The spreedsheets were produced to allow a musicologist with domain knowledge to validate the results. From investigating the results it was quickly clear that this approach had a number of flaws:

Problem 1: Similarity threshold

The threshold where to consider a match to be significantly similar was very hard to define. False positives began to appear with very high scores. This is obvious in the above image where “Smith, John Christopher I” is considered to be the same as “Smith, John Christopher II” purely because there is only one character different.

Problem 2: Proper-nouns

Composer names are proper-nouns and as such don’t work well with WordNets spelling correction. For example a “Bell, Andrew” might not be a spelling mistake of “Tell, Andrew“.

Problem 3: No context

The spreedsheet didn’t offer our Musicologist expert any context for the names. As Alignment API just compares strings we weren’t able to provide additional metadata such as data set source or a link to find more information about the specific composer.

The lack of this additional context proved to be the most crucial, without it there was no way that our expert could say for sure that two strings were indeed a match.

Alignment Prototype 2: Custom Tool

The first attempt was unsuccessful but having learned from the failings of the first attempt we decided to have another go. A completely automated solution was out of the question so we decided to turn our efforts to a custom tool that could quickly enable our expert to find all the relevant matches across data sets.

From our work on the mSpace faceted browser we’ve learnt a lot about data displayed in lists. We’ve learnt that unless filtered “ugly” data tends to float to the top. By “ugly” we refer to data which isn’t generally alphanumeric, for example:

We also know, from our work in creating an mSpace interface on EPrints, that multiple instances of the same name can normally be found very close to one another. This is largely because most sources tend to list the surname first and the ambiguity comes from how they represent the forenames, for example:

With these factors in mind we came up with the following UI for our second attempt at an alignment solution:

Ungrouped/Grouped Lists

The left hand side contains two lists, one for all items that have yet to have been “grouped” and the second which shows all Grouped items. Each list has an A-Z control for quick indexing into the data and can be fully controlled by the keyboard. By selecting multiple items from the ungrouped list the user (at the touch of a single keystroke) can create a group from the selection.

Verified Toggle

The verified toggle lets the user see either expert-verified or non expert-verified groups. This is included as we provided a basic string matching algorithm in the tool to find likely matches. Unlike the first prototype, the string matches are a lot more rudimentary and hopefully produce fewer false positives (at the cost of fewer over all matches).

The algorithm produces a hash for each string it knows about by first converting all characters with diacritics to their equivalent alpha character. It then removes all non alpha’s from the string and converts the result to lowercase. This proved useful for our datasets as some data-sets suffixed a composers name with their birth and death dates, e.g. Schubert, Franz (1808-1878.). In these cases this would produce exactly the same hash as the non suffixed string and produce a likely candidate for a group.

Expert verification is still required but this algorithm helped present likely groupings to minimise the experts workload.

Context/Extra Metadata

The right hand side is where the user can see extended metadata about one or more items. This view displays all relevant information that we know about each composer and can be used by the expert to decide if more than one instance are in fact the same.

Metadata Match

Metadata Match is an additional visual prompt to help the expert decide whether composers are the same. If multiple composers are selected and their birth or death dates match they will be highlighted in grey. If the user hovers over the date with their mouse, all matching dates are highlighted.

Want to use the tool on your data?

The tool is currently in active development and when it reaches a stable state we will release an official download. It’s relatively generic and should be useful to any 3rd parties looking to perform a similar alignment/cleanup of their own data. If however you’re interested in having a play straight away then head on over to our Google Code site and grab the latest SVN.

Note: we can’t make the complete set of our music data available as we have access to it under a copyright licence, but we will in the near future provide import scripts and some sample data to serve as an example.

 
18 Comments

Posted in Software

 

data.ac.uk revisited

18 Aug

Following on from Dan’s recent post about data.ac.uk there is a growing consensus that this is a necessary step that needs to be made to facilitate the growth of usable linked data in Higher Education. Next week there will be a public briefing paper from JISC on the topic. We expect that they will announce, among other things, that the data.ac.uk domain name has been ring-fenced for to provide a data.gov.uk-esque repository for HE.

The tools to visualise this type of data have been around for a long time but as yet there has been no long term strategy for maintaining the actual data that drives these tools.

Why data.ac.uk?

So why do we need a centralised datastore for HE data? Why can’t we just host it locally on data.southampton.ac.uk?

In our original post Dan highlighted our stance on the need for institutional agnosticism for the data we are creating for this project:

We also found that it would be best to not use a subdomain of our school (such as musicnet.soton.ac.uk), since this would be seen as partisan to the school/university and is likely to get less uptake that something at a higher level (such as musicnet.ac.uk).

It doesn’t really make sense to expose the MusicNet data (a set of canonical names and datapoints for classical music composers) on a URI which contains the institution (in our case University of Southampton). The data has global significance and it just so happens that we are the ones tasked with exposing it. Our data is a perfect fit for a non politically aligned data.ac.uk implementation.

In his post Time for data.ac.uk? Or a local data.open.ac.uk? Tony Hirst raises some interesting questions about where we should be storing data:

Another possible source of data in a raw form is from the data.gov.uk education datastore (an example can be found via here, which makes me wonder about the extent to which a data.ac.uk website might just be an HE/FE view over that wider datastore? (Related: @kitwallace on University data.) And then maybe, hence: would data.*.ac.uk be a view over data.ac.uk for a particular institution. Or *.sch.ac.uk a view over a data.sch.ac.uk view over the full education datastore?

These questions are yet to be answered but he does make an interesting point drawing on a presentation given by Mike Nolan at mashlib2010. At present a lot of the information that could be (re-)exposed at minimal cost would be the syndicated data that most institutions already make available, RSS, CalDav etc. I would argue however that these institutional specific datasets, being themselves already intrinsically politically aligned, would be a better fit for localised hosting on a data.southampton.ac.uk type URI. The URI then infers ownership and some context/authority to the data held there.

So perhaps there is room for both data.ac.uk AND data.southampton.ac.uk?

How to make data.ac.uk work?

The only way we can make any datastore work is if the data is available in formats that people are able to make use of! For MusicNet we intend to host RDF-XML on our local server (until a data.ac.uk alternative becomes a reality) using the standard content negotiation to allow for a human readable HTML representation to be presented to casual users.

We also intend to investigate the Linked Data API that was announced at the Second London Linked Data Meetup and has been developed by Dave ReynoldsJeni Tennison and Leigh Dodds. The LD API will allow us to also provide our dataset in formats such as JSON & Turtle using a RESTful querying API, which is currently the protocol of choice for mashup/web2.0 developers.

I would like to see a similar infrastructure in place on a centrally hosted data.ac.uk.

 
 

Data, URIs & Permanence

12 Jul

There is a strong drive in the UK at the moment to turn public sector data into semantically marked up and globally accessible resources (see OpenPSI for examples of use). There is also a heavy push from academic funders to make the outputs of research projects available in a similar format, like the JISCexpo call that this project is funded under.

Whilst there seems to be a lot of focus on the creation of URI schemes (Jeni Tennison, data.gov.uk) there doesn’t seem to be the same consideration given to the permanence of the resources created. Typically a project concerned with the creation of Linked Data will be funded for a fixed period of time. During this time the funding pays for staff to conduct the work required, which ultimately results in the production of some Linked Data. Once the project has finished and the funding has stopped, what happens to the data that is produced?

One of the primary requirements of useful Linked Data is that that the URIs created exist in perpetuity, so that future data sets can be linked to them. Assuming a project is funded for a year and that there is a commitment to locally host the data for a further year, what happens after this period has elapsed?

What we need is provision for UK academic data to be hosted on the JANET network under a suitable .ac.uk domain. For example the data outputs of this project could be hosted on http://musicnet.ac.uk, ensuring that:

  1. The URIs are Institution independent
  2. There is no ongoing administration cost for renewing a project domain name (.ac.uk are a one off payment)



The current rules make it hard for short term projects to acquire an academic ac.uk domain, as a project must be funded for two years as a requirement. It is important that there be a process to decide which projects should and shouldn’t get a domain name but as the way we use the web changes, with focus shifting more to semantically marked up data rather than just human consumed HTML, the academic research community need to discuss & rethink the metric on which this decision is made.

This issue of persistent URI hosting is a new and increasingly important problem for the emerging Semantic Web. If we started to asses projects based on their impact over time as opposed to just their funding duration we might encourage the creation of more short term projects that expose Linked Data, which can only be a good thing for the community at large!

There is also the issue of hosting. Who will actually provide the server space where the data is to reside. However this is a secondary issue, one that is easier to discuss once a system is in place for maintaining the actual URIs.