You are browsing the archive for rdf.

OpenBiblio workshop report

- May 9, 2011 in Bibliographic, BibServer, communityBenefits, Data, event, inf11, jisc, JISC OpenBib, jiscEXPO, jiscLMS, jiscopenbib, OKFN Openbiblio, progress, progressPosts, rdf, Semantic Web, WIN

#openbiblio #jiscopenbib The OpenBiblio workshop took place on 6th May 2011, at London Knowledge Lab


  • Peter Murray-Rust (Open Bibliography project, University of Cambridge, IUCr)
  • Mark MacGillivray (Open Bibliography project, University of Edinburgh, OKF, Cottage Labs)
  • William Waites (Open Bibliography project, University of Edinburgh, OKF)
  • Ben O’Steen (Open Bibliography project, Cottage Labs)
  • Alex Dutton (Open Citation project, University of Oxford)
  • Owen Stephens (Open Bibliographic Data guide project, Open University)
  • Neil Wilson (British Library)
  • Richard Jones (Cottage Labs)
  • David Flanders (JISC)
  • Jim Pitman (Bibserver project, UCB) (remote)
  • Adrian Pohl (OKF bibliographic working group) (remote)
During the workshop we covered some key areas where we have seen some success already in the project, and discussed how we could continue further.

Open bibliographic data formats

In order to ensure successful sharing of bibliographic data, we require agreement on a suitable yet simple format via which to disseminate records. Whilst representing linked data is valuable, it also adds complexity; however, simplicity is key for ensuring uptake and for enabling easy front end system development. Whilst data is available as RDF/XML, JSON is now a very popular format for data transfer, particularly where front end systems are concerned. We considered various JSON linked data formats, and have implemented two for further evaluation. In order to make sure this development work is as widely applicable as possible, we wrote parsers and serialisers for JSON-LD and RDF/JSON as plugins for the popular RDFlib. The RDF/JSON format is, of course, RDF; therefore, it requires no further change to enable it to handle our data, and our RDF/JSON parser and serialiser are already complete. However, it is not very JSON-like, as data takes the subject(predicate(object)) form rather than the general key:value form. This is where JSON-LD can improve the situation – it provides for listing information in a more key:value-like format, making it easier for front end developers not interested in the RDF relations to utilise. But this leads to additional complexity in the spec and parsing requirements, so we have some further work to complete: * remove angle brackets from blank nodes * use type coersion to move types out of main code * use language coersion to omit languages Our code is currently available in our repository, and we will request that our parsers and serialisers get added to RDFlib or to RDFextras once they are complete (they are still in development at present). To further assist in representing bibliographic information in JSON, we also intend to implement BibJSON within JSON-LD; this should provide the necessary lined data functionality where necessary via JSON-LD support, whilst also enabling simpler representation of bibliographic data via key:value pairs where that is all that is required. By making these options available to our users, we will be able to gauge the most popular representation format. Regardless of format used, a critical consideration is that of stable references to data. Without this maintaining datasets will be very hard. To date, the British Library data for example does not have suitable identifiers. However, the BL are moving forward with applying identifiers and will be issuing a new version of their dataset soon, which we will take as a new starting point. We have provided a list of records that we have identified as non-unique, and in turn the BL will share the tools they use to manage and convert data where possible, to enable better community collaboration.

Getting more open datasets

We are building on the success of the BL data release by continuing work on our CUL and IUCr data, and also by getting more datasets. The latest is the Medline dataset; there were some initial issues with properly identifying this dataset, so we have a previous blog post and a link to further information, the Medline DTD and specifications of the PubMed data elements to help.

The Medline dataset

We are very excited to have the Medline dataset; we are currently working on cleaning so that we can provide access to all the non-copyrightable material it contains, which should represent a listing of about 98% of all articles published in PubMed. The Medline dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. This also means that further updates will be trackable as they will append to the current dataset. We have found that most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID, and that some contain further metadata such as citations, which we will remove. Once this is done, and we have checked that there are unique IDs (e.g. that the PubMed IDs are unique) we will make the raw CC0 collection available, then attempt to get it into our Bibliographica instance. We will then also be able to generate visualisations on our total dataset, which we hope will be approaching 30 million records by the end of the JISC Open Bibliography project.

Displaying bibliographic records

Whilst Bibliographica allows for display of individual bibliographic records and enables building collections of such records, it does not yet provide a means of neatly displaying lists of bibliographic records. We have partnered with Jim Pitman of Berkeley University to develop his BibServer to fit this requirement, and also to bring further functionality such as search and faceted browse. This also provides further development direction for the output of the project beyond the July end date of the JISC Open Bibliography project.

Searching bibliographic records

Given the collaboration between Bibliographica and BibServer on collection and display of bibliographic records, we are also considering ways to enable search across non-copyrightable bibliographic metadata relating to any published article. We believe this may be achievable by building a collection of DOIs with relevant metadata, and enabling crowdsourcing of updates and comments. This effort is separate to the main development of the projects, however would make a very good addition both to the functionality of developed software and to the community. This would also tie in with any future functionality that enables author identification and information retrieval, such as ORCID, and allowing us to build on the work done at sites such as BIBKN

Disambiguation without deduplication

There have been a number of experiments recently highlighting the fact that a simple LUCENE search index over datasets tends to give better matches than more complex methods of identifying duplicates. Ben O’Steen and Alex Dutton both provided examples of this, from their work with the Open Citation project. This is also supported by a recent paper from Jeff Bilder entitled “Disambiguation without Deduplication” (not publicly available). The main point here is that instead of deduplicating objects we can simply do machine disambiguation and make sameAs-ness assertions between multiple objects; this would enable changes to still be applied to different versions of an object by disparate groups (e.g. where each group has a different spelling or identifier, perhaps, for some key part of the record) whilst still maintaining a relationship between the two objects. We could build on this sort of functionality by applying expertise from the library community if necessary, although deduplication/merging should only be contemplated if there is a new dataset being formed which some agent is taking responsibility to curate. If not, better to just cluster the data by SameAs assertions, and keep track of who is making those assertions, to assess their reliability. We suggest a concept for increasing collaboration on this sort of work – a ReCaptcha of identities. Upon login, perhaps to a Bibliographica or another relevant system, a user could be presented with two questions, one of which we know the answer to, and the other being a request to match identical objects. This, in combination with decent open source software tools enabling bibliographic data management (building on tools such as Google Refine and Needlebase), would allow for simple verifiable disambiguation across large datasets.

Sustaining open bibliographic data

Having had success in getting open bibliographic datasets and prototyping their availability, we must consider how to maintain long term open access. There are three key issues:

Continuing community engagement

We must continue to work with the community, and to provide explanatory information to those needing to make decisions about bibliographic data, such as the OpenBiblio Principles and the Open BIbliographic Data guide. We must also ensure we improve resource discovery by supporting the requirement for generating collections and searching content. Additionally, quality bibliographic data should be hosted at some key sites – there are a variety of options such as Freebase, CKAN, bibliographica – but we must also ensure that community members can be crowdsourced both for managing records within these central options and also for providing access to smaller distributed nodes, where data can be owned and maintained at the local level whilst being discoverable globally.

Maintaining datasets

Dataset maintenance is critical to ongoing success – stale data is of little use to people and disregard for content maintenance will put off new users. We must co-ordinate with source providers such as the BL by accepting changesets from them and incorporating that into other versions. This is already possible with the Medline data, for example, and will very soon be the case with BL updates too. We should advocate for this method of dataset updates during any future open data negotiations. This will allow us to keep our datasets fresh and relevant, and to properly represent growing datasets. We must continue to promote open access to non-copyrightable datasets, and ensure that there is a location for open data providers to easily make their raw datasets available – such as CKAN. We will ensure that all the software we have developed during the course of the project – and in future – will remain open source and publicly available, so that it will be possible for anyone to perform the transforms and services that we can perform.

Community involvement with dataset maintenance

We should support community members that wish to take responsibility for overseeing updating of datasets. This is critical for long term sustainability, but hard to find. These people need to be recruited and provided with simple tools which will empower them to easily maintain and share datasets they care about with a minimal time commitment. Thus we must make sure that our software and tools are not only open source, but usable by non-team members. We will work on developing tools such as ReCaptcha for disambiguation, and on building game / rank table functionality for those wishing to participate in entity disambiguation (in addition to machine disambiguation).

Critical mass

We hope that by providing almost 30 million records to the community under CC0 license, and with the support of all the providers that made this possible, we will achieve a critical mass of data, and an exemplar for future open access to such data. This should provide the go-to list of such information, and inspire others to contribute and maintain. However, such community assistance will only continue for as long as there appears to be reasonable maintenance of the corpus and software we have already developed – if this slips into disrepair, community engagement is far less likely.

Maintaining services

The bibliographica service that we currently run already requires significant hardware to run. Once we add in Medline data, we will require very large indexes, requiring a great deal of RAM and fast disks. There is therefore a long term maintenance requirement implicit in running any such central service of open bibliographic data on this scale. We will present a case for ongoing funding requirements and seek sources for financial support both for technical maintenance and for ongoing software maintenance and community engagement.

Business cases

In order to ensure future engagement with groups and business entities, we must make clear examples of the benefits of open bibliographic data. We have already done some work on visualising the underlying data, which we will develop further for higher impact. We will identify key figures in the data that we can feed into such representations to act as exemplars. Additionally, we will continue to develop mashups using the datasets, to show the serendipitous benefit that increases exposure but is only possible with unambiguously open access to useful data.

Events and announcements

We will continue to promote our work and the efforts of our partners, and advocate further for open bibliography, by publicising our successes so far. We will co-ordinate this with JISC, BL, OKF and other interested groups, to ensure the impact of announcements by all groups are enhanced. We will present our work at further events throughout the year, such as attendance and sessions at OKCon, OR11 and other conferences, and by arranging further hackdays.

Follow-up to serialising RDF in JSON

- May 5, 2011 in BibServer, Data, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, OKFN Openbiblio, ontology, progress, rdf, Semantic Web

Following on from Richard’s post yesterday, we now have a JSON-LD serialiser for RDFlib. This is still a work in progress, and there may be things that it is serialising incorrectly. So, please give us feedback on this, and tell us where we have misinterpreted the structure. Here you will find a sample JSON-LD output file, which was generated from this Bibliographica record. The particular area of concern surrounds how the JSON-LD spec describes serialising disjoint graphs into JSON-LD (section 8.2). How does this differ from serialising joined graphs? We are presuming all that our output file is an example of a joined graph, and that additional disjoint graphs would be added by appending additional @:[] sections.

Comparative Serialisation of RDF in JSON

- May 4, 2011 in BibServer, Data, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, model, OKFN Openbiblio, ontology, outputs, progress, progressPosts, rdf, Semantic Web

This is a comparison of RDF-JSON and JSON-LD for serialising bibliographic RDF data. Given that we are also working with BibServer we have taken a BibJSON document as our source data for comparison. The objective was to both understand these two JSON serialisations of RDF and also to look at the BibJSON profile to see how it fits into such a framework. Due to limitations of the display of large plain-text code snippets on the site, we have placed the actual content in this text file which you should refer to as we go along. We used a BibJSON document, which comes from the examples on the BibJSON homepage. When converting this into the two RDF serialisations we invent a namespace
This namespace provisionally holds all predicates/keys that are used by BibJSON and are not immediately clearly available in another ontology. These terms should not under any circumstances be considered definitive or final, only indicative. Now consider the RDF-JSON serialisation Some key things to note about this serialisation:
  • There is no explicit shortening of URIs for predicates into CURIEs, all URIs are instead presented in full.
  • The subject of each predicate is a JSON object with up to 4 keys (value, type, datatype, lang). This means that it is not easy for the human eye to pick out the value of a particular predicate.
  • Of the two RDF serialisations, this is by far the most verbose
  • It is relatively difficult for a human to read and write
Compare this with the equivalent JSON-LD serialisation: Some things to note about this serialisation:
  • It has a clear treatment of namespaces
  • It may be slightly inaccurate, as there are some parts of its specification which are ambiguous – feedback welcome
  • The object values cannot be taken as the value of the predicate, as they may contain datatype and/or language information in them, or may be surrounded by angled brackets.
  • It is relatively easy for a human to read and write
Both serialisations are capable of representing the same data, although JSON-LD is far more terse and therefore easier to read and write. It is not, however, possible to reliably treat JSON-LD as a pure list of key-value pairs in non-RDF aware environments, as it includes RDF type and language semantics in the literal values of objects. RDF-JSON does not suffer from this same issue within the object literals, but in return its notation is more complex. A serious lacking in RDF-JSON is explicit handling of CURIEs and namespaces, and it could benefit from adopting the conventions laid out in JSON-LD – this may bring the choice of which serialisation to use down to preference rather than relying on any significant technical differences. Each of the formats also comfortably represents BibJSON, and with the extensive lists of predicates provided in that specification it would be straightforward enough to do a full and proper treatment of BibJSON through one of these routes.

"Bundling" instances of author names together without using owl:sameas

- November 17, 2010 in inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, ORDF, progress, projectMethodology, rdf, Semantic Web

Bundling? It’s a verb I’ve taken from ”Glaser, H., Millard, I., Jaffri, A., Lewy, T. and Dowling, B. (2008) On Coreference and The Semantic Web” where the core idea is that you have a number of URIs that mean or reference the same real thing, and the technique they describe of bundling is to aggregate all those references together. The manner in which they describe is built on a sound basis in logic, and is related to (if not the same as) a congruent closure. The notion of bundling I am using is not as rooted in terms of mathematical logic, because I need to convey an assertion that one URI is meant to represent the same thing that another URI represents in a given context and for a given reason. This is a different assertion, if only subtly different, than ‘owl:sameas’ asserts, but the difference is key for me. It is best to think through an example of where I am using this – curating bibliographic records and linking authors together. It’s an obvious desire – given a book or article, to find all the other works by an author of that said work. Technologically, with RDF this is a very simple proposition BUT the data needs to be there. This is the point where we come unstuck. We don’t really have that quality of data that firmly establishes that one author is the same as a number of others. String matching is not enough! So, how do we clean up this data (converted to RDF) so that we can try to stitch together the authors and other entities in them? See this previous post on augmenting British Library metadata so that the authors, publishers and so on are externally reference-able once they are given unique URIs. This really is the key step. Any other work that can be done to make any of the data about the authors and so on more semantically reference-able will be a boon to the process of connecting the dots, as I have done for authors with birth and/or death dates. The fundamental aspect to realise is that we are dealing with datasets which have missing data, misrepresented data (typos), misinterpreted fields (ISBNs of £2.50 for example) and other non-uniform and irregular problems. Connecting authors together in datasets with these characteristics will rely on us and code that we write making educated guesses, and probabilistic assertions, based on how confident we are that things match and so on. We cannot say for sure that something is a cast-iron match, only that we are above a certain limit of confidence that this is so. We also have to have a good reason as well. Something else to take on board is that what I would consider to be a good match might not be good for someone else so there needs to be a manner to state a connection and to say why, who and how this match was made as well as a need to keep this data made up of assertions away from our source data. I’ve adopted the following model for encoding this assertion in RDF, in a form that sits outside of the source data, as a form of overlay data and you can find the bundle ontology I’ve used at (pay no attention to where it is currently living): Click to view in full, unsquished form: Bundle of URIs, showing use of OPMV The URIs shown to be ‘opmv:used’ in this diagram are not meant to be exhaustive. It is likely that a bundle may depend on a look-up or resolution service, external datasheets, authority files, csv lists, dictionary lists and so on. Note that the ‘Reason’ class has few, if any, mandatory properties aside from its connection to a given Bundle and opmv:Process. Assessing if you trust a Bundle at this moment is very much based on the source and the agent that made the assertion. As things get more mature, more information will regularly find its place attached to a ‘Reason’ instance. There are currently two subtypes of Reason: AlgorithmicReason and AgentReason. Straightforwardly, this is the difference between a machine-made match and a human-made match and use of these should aid the assessment of a given match. Creating a bundle using python: I have added a few classes to Will Waites’ excellent ‘ordf’ library, and you can find my version here. To create a virtualenv to work within, do as follows. You will need mercurial and virtualenv already installed: At a command line – eg ‘[@localhost] $’, enter the following:
hg clone
virtualenv myenv
. ./myenv/bin/activate
(myenv) $ pip install ordf
So, creating a bundle of some URIs – “info:foo” and “info:bar”, due to a human choice of “They look the same to me :) ”: In python: code here from ordf.vocab.bundle import Bundle, Reason, AlgorithmicReason, AgentReason from ordf.vocab.opmv import Agent from ordf.namespace import RDF, BUNDLE, OPMV, DC # you are likely to use these yourself from ordf.term import Literal, URIRef # when adding arbitrary triples b = Bundle() """or if you don't want a bnode for the Bundle URI: b = Bundle(identifier="")""" """ NB this also instantiates empty bundle.Reason and opmv.Process instances too in b.reason and b.process which are used to create the final combined graph at the end""" b.encapsulate( URIRef("info:foo"), URIRef("info:bar") ) """ we don't want the default plain Reason, we want a human reason:""" r = AgentReason() """ again, pass a identifier="" kw to set the URI if you wish""" r.comment("They look the same to me :) ") """Let them know who made the assertion:""" a = Agent() a.nick("benosteen") a.homepage("") """ Add this agent as the controller of the process:""" b.process.agent(a) g = b.bundle_graph() # this creates an in-memory graph of all the triples required to assert this bundle """ easiest way to get it out is to "serialize" it:""" print g.serialize() ============== Output:
<?xml version="1.0" encoding="UTF-8"?>
  <rdf:Description rdf:nodeID="PZCNCkfJ2">
    <rdfs:label> on monster (18787)</rdfs:label>
    <ordf:pid rdf:datatype="">18787</ordf:pid>
    <opmv:wasControlledBy rdf:nodeID="PZCNCkfJ9"/>
    <ordf:version rdf:nodeID="PZCNCkfJ4"/>
    <rdf:type rdf:resource=""/>
  <rdf:Description rdf:nodeID="PZCNCkfJ0">
    <bundle:encapsulates rdf:resource="info:bar"/>
    <bundle:encapsulates rdf:resource="info:foo"/>
    <bundle:justifiedby rdf:nodeID="PZCNCkfJ5"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
    <rdf:type rdf:resource=""/>
  <rdf:Description rdf:nodeID="PZCNCkfJ5">
    <rdf:type rdf:resource=""/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
  <rdf:Description rdf:nodeID="PZCNCkfJ9">
    <rdf:type rdf:resource=""/>
    <foaf:homepage rdf:resource=""/>
  <rdf:Description rdf:nodeID="PZCNCkfJ4">

Given a triplestore with these bundles, you can query for ‘same as’ URIs via which Bundles a given URI appears in.