RSS
 

Posts Tagged ‘jiscLMS’

Add MusicNet data to COPAC with ‘Composed’ bookmarklet

03 Oct

Here’s a great example of how the MusicNet data can be used to enhance existing sites. The ‘Composed’ bookmarklet decorates an existing COPAC composer record with all the extra information that MusicNet contains about that person.

Head on over to this blog post for more details. Incidentally this was created as an entry to the UK Discovery competition we blogged about earlier in the year.

So, who’s going to turn this into a GreaseMonkey script so that the bookmarklet isn’t need?

 
 

UK Discovery Developer Competition features the MusicNet dataset

11 Jul

The MusicNet dataset has been included as part of the UK Discovery global developer competition. The rules of the competition are simple, build an app/tool that makes use of at least one of the 10 featured datasets.

UK Discovery is working with libraries, archives and museums to open up data about their resources for free re-use and aggregation. DevCSIis working with developers in the education sector, many of who will have innovative ideas about how to exploit this open data in new applications.

This Developer Competition runs throughout July 2011. It starts on Monday 4 July – Independence Day, a good day for liberating data – and closes on Monday 1 August. It’s open to anyone anywhere in the world.

For more information about the competition see http://discovery.ac.uk/developers/competition/. Prizes are available for the best entrants, competition ends Monday 1 August 2011.

 
 

Tweets from Music Linked Data Workshop (#MLDW)

24 May

Here are the archived tweets from the Music Linked Data Workshop we held at JISC London earlier this month. Slides from the event can be found here.

Read the rest of this entry »

 

Progress Update

10 Mar

Its time for a short update on how the project is progressing. We’ve had an incrementally feature-full prototype of our Codex available on our project web since January and we’ve been working hard to improve it. If you haven’t already then head on over to http://musicnet.mspace.fm/codex and search for a composer.

What have we added since January?

Content Negotiation

One of the most important features we’ve added since January is content-negotiation. This enables our Codex to serve up the most appropriate content dependant on the ‘Accept’ header received in the HTTP request. For a more detailed writup see Dan’s blog post on the MusicNet URI Scheme.

A simple example would be:

Franz Schuberts URI is: http://musicnet.mspace.fm/person/7ca5e11353f11c7d625d9aabb27a6174

If we request this from a regular web browser we are dereferenced to the HTML content at: http://musicnet.mspace.fm/person/7ca5e11353f11c7d625d9aabb27a6174.html

However, if we request this URI from a semantic web browser we are dereferenced the RDF content at: http://musicnet.mspace.fm/person/7ca5e11353f11c7d625d9aabb27a6174.rdf

Data Enrichment

We have also been working hard to leverage the data we’ve aligned over the last year to enrich the information provided by our various data partners. Last year we met with the LinkedBrainz team and they provided us with a small set of composer data from MusicBrainz for us to align against. This has allowed us to draw additional information from other open data sources such as the BBC, Wikipedia/DBPedia, IMDB and even the New York Times to provide a more complete representation of the data available about a composer.

This data is available in both the RDF and the HTML representation of the Codex.

e.g. Schubert, Franz (HTML | RDF)

Alignment Progress

Alignment is moving on well and we’re currently at 89%.

What is left to do?

One of the discussions the MusicNet team has been involved in since the start of the project has been data.ac.uk and the need for in perpetuity hosting of URI’s minted by JISC projects.

We’re currently in discussions to be one of the first projects to be able to make use of this domain and hope that by the end of the project we’ll be able to move our Codex and URI’s over to a suitable domain such as musicnet.data.ac.uk. This will ensure that the data we’ve exposed will be available after the project’s end.

MusicNet Workshop

We’re also hosting a small workshop on the 12th May at JISC HQ to try and expose more people to the potential of the MusicNet URI’s. The workshop will also be looking more broadly at the current Music & Linked Data landscape & should cater to a broad audience. It’s filling up very quickly so if you’re interested and haven’t yet made contact please do so soon.

For more details see our announcement

 
 

MusicNet & LinkedBrainz Meetup

25 Oct

Overview

Last Friday the MusicNet team headed to QMUL to meet with Kurt Jacobson & Simon Dixon from the LinkedBrainz project. LinkedBrainz is also funded by the JISC Expose (#jiscexpo) programme and is working, in conjunction with MusicBrainz, to produce an official Linked Data mapping for the MusicBrainz database. You can follow their progress on the project blog at http://linkedbrainz.c4dmpresents.org/.

What does a collaboration look like?

As well as learning a bit more about each others projects we were able to look at a few ways in which we might be able to collaborate over the coming months. We also came away with a data export from the most recent MusicBrainz database for all Classical musicians. This will essentially allow us to link our exposed composer URIs directly to the MusicBrainz (or LinkedBrainz) equivalents. This will greatly increase the utility of our URIs, especially as organisations such as the BBC are already using the MusicBrainz IDs.

Adding “same-as” links to LinkedBrainz is only one side of the solution, ideally it would be great if we could convince the MusicBrainz community to provide the reverse linking. This is likely to be a longer term outcome and one we should approach once the sustainability of URIs issue has been resolved (data.ac.uk?).

How will we align our URIs to LinkedBrainz?

We’ll use out custom built Alignment Tool! Over the last few months we’ve spent quite a while engineering the tool and making sure its as re-usable as possible, we plan to add the LinkedBrainz data as though it were just another partner’s catalog. This means that once our Musicology expert has performed the alignment we’ll not only know the overlaps between our partners catalogs but we’ll also know how they map to MusicBrainz and by proxy to Wikipedia & the BBC etc.

 
 

Alignment Tool (Beta Release 1)

30 Sep

There has been quite a bit of interest in the tool we’ve been developing to solve the (mis-)alignment of data across our multiple catalogs.

Today we’re pleased to be able to make a release available for download, albeit in an unsupported beta form!

Download Alignment Tool (beta 1) (Firefox 3.5+ or WebKit only)

Whilst we can’t make the data we’re working on available for download, we’ve created an example data set using names from the ECS Eprints repository. The use-case example here is that you have multiple name representations for specific individuals and you’d like to find the matches. Once installed the alignment tool will list all ‘Authors’ and when requested will present basic information about the papers/deposits that the author string was found on. For justifications on why the additional metadata/context is important be sure to read the previous post about the development process for the tool.

If you’d like to get the latest source-code for the tool as it evolves/improves you can check it out from our SVN repository which is hosted on the project’s Google Code page.

Finally, if you’d like to see what the Alignment tool should look like once you’ve got it installed then take a look at this screencast produced by our resident Musicologist David Bretherton for his recent presentation at AHM2010.

 
 

Assisted Manual Data Alignment

08 Sep

One of the key milestones for the MusicNet project is to match up similar records in a number of source catalogues. Specifically we are looking for records that represent the same musical composer in multiple data sets.

As discussed in the report for our first work package (WP1 Data Triage) the reason this task needs to be undertaken is two fold:

  1. Different naming conventions (e.g. Bach, Johann Sebastian or Bach, J. S.)
  2. User input errors (e.g. Bach, Johan Sebastien)

Due to the scale of the datasets the obvious approach is to try and minimise the work by computationally looking for matches. This is the approach we attempted when trying to solve this same problem for the musicSpace project (precursor to MusicNet).

Alignment Prototype 1: Alignment API

In musicSpace we used Alignment API to compare composers names as strings. Alignment API calculates the similarity of two strings and produces a metric used to measure how closely they match. It also uses WordNet to enable word stemming, synonym support and so on.

We had interns create a system that would produce Google Doc spreedsheets as outputs of data-set comparisons. The spreedsheet consisted of a column of composer names from one data set, next to a column of composer names from the data set to compare and finally the similarity metric.

The spreedsheets were produced to allow a musicologist with domain knowledge to validate the results. From investigating the results it was quickly clear that this approach had a number of flaws:

Problem 1: Similarity threshold

The threshold where to consider a match to be significantly similar was very hard to define. False positives began to appear with very high scores. This is obvious in the above image where “Smith, John Christopher I” is considered to be the same as “Smith, John Christopher II” purely because there is only one character different.

Problem 2: Proper-nouns

Composer names are proper-nouns and as such don’t work well with WordNets spelling correction. For example a “Bell, Andrew” might not be a spelling mistake of “Tell, Andrew“.

Problem 3: No context

The spreedsheet didn’t offer our Musicologist expert any context for the names. As Alignment API just compares strings we weren’t able to provide additional metadata such as data set source or a link to find more information about the specific composer.

The lack of this additional context proved to be the most crucial, without it there was no way that our expert could say for sure that two strings were indeed a match.

Alignment Prototype 2: Custom Tool

The first attempt was unsuccessful but having learned from the failings of the first attempt we decided to have another go. A completely automated solution was out of the question so we decided to turn our efforts to a custom tool that could quickly enable our expert to find all the relevant matches across data sets.

From our work on the mSpace faceted browser we’ve learnt a lot about data displayed in lists. We’ve learnt that unless filtered “ugly” data tends to float to the top. By “ugly” we refer to data which isn’t generally alphanumeric, for example:

We also know, from our work in creating an mSpace interface on EPrints, that multiple instances of the same name can normally be found very close to one another. This is largely because most sources tend to list the surname first and the ambiguity comes from how they represent the forenames, for example:

With these factors in mind we came up with the following UI for our second attempt at an alignment solution:

Ungrouped/Grouped Lists

The left hand side contains two lists, one for all items that have yet to have been “grouped” and the second which shows all Grouped items. Each list has an A-Z control for quick indexing into the data and can be fully controlled by the keyboard. By selecting multiple items from the ungrouped list the user (at the touch of a single keystroke) can create a group from the selection.

Verified Toggle

The verified toggle lets the user see either expert-verified or non expert-verified groups. This is included as we provided a basic string matching algorithm in the tool to find likely matches. Unlike the first prototype, the string matches are a lot more rudimentary and hopefully produce fewer false positives (at the cost of fewer over all matches).

The algorithm produces a hash for each string it knows about by first converting all characters with diacritics to their equivalent alpha character. It then removes all non alpha’s from the string and converts the result to lowercase. This proved useful for our datasets as some data-sets suffixed a composers name with their birth and death dates, e.g. Schubert, Franz (1808-1878.). In these cases this would produce exactly the same hash as the non suffixed string and produce a likely candidate for a group.

Expert verification is still required but this algorithm helped present likely groupings to minimise the experts workload.

Context/Extra Metadata

The right hand side is where the user can see extended metadata about one or more items. This view displays all relevant information that we know about each composer and can be used by the expert to decide if more than one instance are in fact the same.

Metadata Match

Metadata Match is an additional visual prompt to help the expert decide whether composers are the same. If multiple composers are selected and their birth or death dates match they will be highlighted in grey. If the user hovers over the date with their mouse, all matching dates are highlighted.

Want to use the tool on your data?

The tool is currently in active development and when it reaches a stable state we will release an official download. It’s relatively generic and should be useful to any 3rd parties looking to perform a similar alignment/cleanup of their own data. If however you’re interested in having a play straight away then head on over to our Google Code site and grab the latest SVN.

Note: we can’t make the complete set of our music data available as we have access to it under a copyright licence, but we will in the near future provide import scripts and some sample data to serve as an example.

 
18 Comments

Posted in Software

 

data.ac.uk revisited

18 Aug

Following on from Dan’s recent post about data.ac.uk there is a growing consensus that this is a necessary step that needs to be made to facilitate the growth of usable linked data in Higher Education. Next week there will be a public briefing paper from JISC on the topic. We expect that they will announce, among other things, that the data.ac.uk domain name has been ring-fenced for to provide a data.gov.uk-esque repository for HE.

The tools to visualise this type of data have been around for a long time but as yet there has been no long term strategy for maintaining the actual data that drives these tools.

Why data.ac.uk?

So why do we need a centralised datastore for HE data? Why can’t we just host it locally on data.southampton.ac.uk?

In our original post Dan highlighted our stance on the need for institutional agnosticism for the data we are creating for this project:

We also found that it would be best to not use a subdomain of our school (such as musicnet.soton.ac.uk), since this would be seen as partisan to the school/university and is likely to get less uptake that something at a higher level (such as musicnet.ac.uk).

It doesn’t really make sense to expose the MusicNet data (a set of canonical names and datapoints for classical music composers) on a URI which contains the institution (in our case University of Southampton). The data has global significance and it just so happens that we are the ones tasked with exposing it. Our data is a perfect fit for a non politically aligned data.ac.uk implementation.

In his post Time for data.ac.uk? Or a local data.open.ac.uk? Tony Hirst raises some interesting questions about where we should be storing data:

Another possible source of data in a raw form is from the data.gov.uk education datastore (an example can be found via here, which makes me wonder about the extent to which a data.ac.uk website might just be an HE/FE view over that wider datastore? (Related: @kitwallace on University data.) And then maybe, hence: would data.*.ac.uk be a view over data.ac.uk for a particular institution. Or *.sch.ac.uk a view over a data.sch.ac.uk view over the full education datastore?

These questions are yet to be answered but he does make an interesting point drawing on a presentation given by Mike Nolan at mashlib2010. At present a lot of the information that could be (re-)exposed at minimal cost would be the syndicated data that most institutions already make available, RSS, CalDav etc. I would argue however that these institutional specific datasets, being themselves already intrinsically politically aligned, would be a better fit for localised hosting on a data.southampton.ac.uk type URI. The URI then infers ownership and some context/authority to the data held there.

So perhaps there is room for both data.ac.uk AND data.southampton.ac.uk?

How to make data.ac.uk work?

The only way we can make any datastore work is if the data is available in formats that people are able to make use of! For MusicNet we intend to host RDF-XML on our local server (until a data.ac.uk alternative becomes a reality) using the standard content negotiation to allow for a human readable HTML representation to be presented to casual users.

We also intend to investigate the Linked Data API that was announced at the Second London Linked Data Meetup and has been developed by Dave ReynoldsJeni Tennison and Leigh Dodds. The LD API will allow us to also provide our dataset in formats such as JSON & Turtle using a RESTful querying API, which is currently the protocol of choice for mashup/web2.0 developers.

I would like to see a similar infrastructure in place on a centrally hosted data.ac.uk.

 
 

data.ac.uk proposal

20 Jul

Since our previous post on a domain name for our project, it has been suggested that a possible scalable solution is a central data repository for the UK academic community (data.ac.uk), in a similar style to data.gov.uk, which is a repository of open public government data.

To recap, the need is that a commercial domain carries a per-year fee (typically 10-20 GBP), which is intractable to maintain for many years for typical academic short-term projects, and thus a purchased domain would not exist in future. While this may not be crucial for a project homepage, it is crucial for Linked Data, where the URLs of the data must exist in perpetuity, because the data themselves use these URIs. Since our project’s output is Linked Data, which is intended to be used by anyone that outputs data that may include Classical Music (we have partners including BBC /music, who are interested in using our URLs), the domain must exist in perpetuity.

We also found that it would be best to not use a subdomain of our school (such as musicnet.soton.ac.uk), since this would be seen as partisan to the school/university and is likely to get less uptake that something at a higher level (such as musicnet.ac.uk).

Current JANET/UKERNA policy does not enable us to have a top level academic domain (musicnet.ac.uk) because they are limited to projects funded centrally, and for at least 2 years. This makes sense to lower the overhead of having to register domains for thousands of small projects.

Thus, a floated suggestion has been for JISC to fund/host a “data.ac.uk” domain and/or repository to provide a linked data domain and/or web hosting solution for academic data publishers (such as MusicNet).

There are two key points that I would like to make:

1) It will provide a lower technical and financial barrier to entry to people that have some RDF to publish.

If a project has some RDF to publish right now, they have to first figure out how to publish it correctly as linked data — few academic projects have managed this, and they probably don’t realise that they could be doing better. By providing a central service that can manage and host data properly, there is also the potential to add extract features. This is analogous to a cloud hosted blog service, such as tumblr.com or blogspot.com, where security patches, and features are added for free by the hosters, without the publishers doing anything. For RDF I can forsee better human readable access, backlink features to external RDF, etc. being added over time, even to legacy RDF that is being hosted for projects that have long since finished. Similarly, the hosting and maintenance of the servers is then the responsibility of the data.ac.uk team, rather than some small-term project. The investment in creating this data is then protected, since the central repository holds it.

2) It must not limit what technically able projects can do.

In our case, we do not require a hosting solution, because we already know how to host data, and we already negotiated ad-inifinitum in perpetuity hosting of our data by the central School of Electronics and Computer Science administration team (this was a key part of our project bid).

Furthermore, we wish to host an alignment service that enables musicologists to make edits to data after it has been published, so that it is kept up-to-date, and any mistakes can be fixed over time. Other projects may have different needs.

Thus it is important for projects like ours that we can apply for just a subdomain (without hosting) of data.ac.uk (musicnet.data.ac.uk) so that we can run our own hosting and bespoke services.

 
 

WP1: Data Triage Report

20 Jul

We recently conducted an assessment of the metadata available in our data partners digital catalogs as part of our stated Work Package 1 deliverable. The aim of this assessment was to ascertain what metadata fields were good candidates to be exposed in our Linked Data.

Here is the introduction to the report:

This document outlines the metadata we aim to expose as part of the MusicNet Project. Decisions on what metadata to include are based on the following factors:

  1. Musicologists needs (recommended by David Bretherton)
  2. Technical feasibility (recommended by Joe Lambert)
  3. Licensing Restrictions

Our remit was to expose only information that is available in the public domain but to ensure enough is made available to allow a composer to be unambiguously identified.

You can download the full report from our Google Code Repository:
WP1: Data Triage Report v1.0 (67KB)