You are browsing the archive for inf11.

Final Product Post: Open Bibliography

- June 30, 2011 in advertisement, FinalProductPost, FinalProjectPost, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, linkeddata, products, ProgressPost, ProjectProgressPost, prototypes, semanticweb

Bibliographic data has long been understood to contain important information about the large scale structure of scientific disciplines, the influence and impact of various authors and journals. Instead of a relatively small number of privileged data owners being able to manage and control large bibliographic data stores, we want to enable an individual researcher to browse millions of records, view collaboration graphs, submit complex queries, make selections and analyses of data – all on their laptop while commuting to work. The software tools for such easy processing are not yet adequately developed, so for the last year we have been working to improve that: primarily by acquiring open datasets upon which the community can operate, and secondarily by demonstrating what can be done with these open datasets.

Our primary product is Open Bibliographic data

Open Bibliography is a combination of Open Source tools, Open specifications and Open bibliographic data. Bibliographic data is subject to a process of continual creation and replication. The elements of bibliographic data are facts, which in most jurisdictions cannot be copyrighted; there are few technical and legal obstacles to widespread replication of bibliographic records on a massive scale – but there are social limitations: whether individuals and organisations are adequately motivated and able to create and maintain open bibliographic resources.

Open bibliographic datasets

SourceDescriptionAvailability
Cambridge University Library This dataset consists of MARC 21 output in a single file, comprising around 180000 records. More info… get the data
British Library The British National Bibliography contains about 3 million records – covering every book published in the UK since 1950. More info… get the data
query the data
International Union of Crystallography Crystallographic research journal publications metadata from Acta Cryst E. More info… get the data
query the data
view the data
PubMed The PubMed Medline dataset contains about 19 million records, representing roughly 98% of PubMed publications. More info… get the data
view the data

Open bibliographic principles

In working towards acquiring these open bibliographic datasets, we have clarified the key principles of open bibliographic data and set them out for others to reference and endorse. We have already collected over 100 endorsements, and we continue to promote these principles within the community. Anyone battling with issues surrounding access to bibliographic data can use these principles and the endorsements supporting them to leverage arguments in favour of open access to such metadata.

Products demonstrating the value of Open Bibliography

OpenBiblio / Bibliographica

Bibliographica is an open catalogue of books with integrated bibliography tools for example to allow you to create your own collections and work with Wikipedia. Search our instance to find metadata about anything in the British National Bibliography. More information is available about the collections tool and the Wikipedia tool. Bibliographica runs on the open source openbiblio software, which is designed for others to use – so you can deploy your own bibliography service and create open collections. Other significant features include native RDF linked data support, queryable SOLR indexing and a variety of data output formats.

Visualising bibliographic data

Traditionally, bibliographic records have been seen as a management tool for physical and electronic collections, whether institutional or personal. In bulk, however, they are much richer than that because they can be linked, without violation of rights, to a variety of other information. The primary objective axes are:
  • Authors. As well as using individual authors as nodes in a bibliographic map, we can create co-occurrence of authors (collaborations).
  • Authors’ affiliation. Most bibliographic references will now allow direct or indirect identification of the authors’ affiliation, especially the employing institution. We can use heuristics to determine where the bulk of the work might have been done (e.g. first authorship, commonality of themes in related papers etc. Disambiguation of institutions is generally much easier than for authors, as there is a smaller number and there are also high-quality sites on the web (e.g. wikipedia for universities). In general therefore, we can geo-locate all the components of a bibliographic record.
  • Time. The time of publication is well-recorded and although this may not always indicate when the work was done, the pressure of modern science indicates that in many cases bibliography provides a fairly accurate snapshot of current research (i.e. with a delay of perhaps one year).
  • Subject. Although we cannot rely on access to abstracts (most are closed), the title is Open and in many subjects gives high precision and recall. Currently, our best examples are in infectious diseases, where terms such as malaria, plasmodium etc. are regularly and consistently used.
With these components, it is possible to create a living map of scholarship, and we show three examples carried out with our bibliographic sets. This is a geo-temporal bibliography from the full Medline dataset. Bibliographic records have been extracted by year and geo-spatial co-ordinates located on a grid. The frequency of publications in each grid square is represented by vertical bars. (Note: Only a proportion of the entries in the full dataset have been used and readers should not draw serious conclusions from this prototype). (A demonstration screencast is available at http://vimeo.com/benosteen/medline; the full interactive resource is accessible with Firefox 4 or Google Chrome, at http://benosteen.com/globe.) This example shows a citation map of papers recursively referencing Wakefield’s paper on the adverse effects of MMR vaccination. A full analysis requires not just the act of citation but the sentiment, and initial inspection shows that the immediate papers had a negative sentiment i.e. were critical of the paper. Wakefield’s paper was eventually withdrawn but the other papers in the map still exist. It should be noted that recursive citation can often build a false sense of value for a distantly-cited object. This is a geo-temporal bibliographic map for crystallography. The IUCr’s Open Access articles are an excellent resource as their bibliography is well-defined and the authors and affiliations well-identified. The records are plotted here on an interactive map where a slider determines the current timeslice and plots each week’s publications on a map of the world. Each publication is linked back to the original article. (The full interactive resource is available at http://benosteen.com/timemap/index.) These visualisations show independent publications, but when the semantic facets on the data have been extracted it will be straightforward to aggregate by region, by date and to create linkages between locations.

Open bibliography for Science, Technology and Medicine

We have made further efforts to advocate for open bibliographic data by writing a paper on the subject of Open Bibliography for Science, Technology and Medicine. In addition to submitting for publication to a journal, we have made the paper available as a prototype of the tools we are now developing. Although somewhat subsequent to the main development of this project, these examples show where this work is taking us – with large collections available, and agreement on what to expect in terms of open bibliographic data, we can now support the individual user in new ways.

Uses in the wider community

Demonstrating further applications of our main product, we have identified other projects making use of the data we have made available. These act as demonstrations for how others could make use of open bibliographic data and the tools we (or others) have developed on top of them. Public Domain Works is an open registry of artistic works that are in the public domain. It was originally created with a focus on sound recordings (and their underlying compositions) because a term extension for sound recordings was being considered in the EU. However, it now aims to cover all types of cultural works, and the British National Bibliography data queryable via http://bibliographica.org provides an exemplar for books. The Public Domain Works team have built on our project output to create another useful resource for the community – which could not exist without both the open bibliographic data and the software to make use of it. The Bruce at Brunel project was also able to make use of the output of the JISC Open Bibliography project; in their work to develop faceted browse for reporting, they required large quality datasets to operate on, and we were able to provide the open Medline dataset for this purpose. This is a clear advantage for having such open data, in that it informs further developments elsewhere. Additionally, in sharing these datasets we can receive feedback on the usefulness of the conversions we provide. A further example involves the OKF Open Data in Science working group; Jenny Molloy is organising a hackathon as part of the SWAT4LS conference in December 2011, with the aim of generating open research reports using bibliographic data from PubMedCentral, focussing on malaria research. It is designed to demonstrate what can be done with open data, and this example highlights the concept of targeted bibliographic collections: essentially, reading lists of all the relevant publications on a particular topic. With open access to the bibliographic metadata, we can create and share these easily, and as required. Additionally, with easy access to such useful datasets comes serendipitous development of useful tools. For example, one of our project team developed a simple tool over the course of a weekend for displaying relevant reading lists for events at the Edinburgh International Science Festival. This again demonstrates what can be done if only the key ingredient – the data – is openly available, discoverable and searchable.

Benefits of Open Bibliography products

Anyone with a vested interest in research and publication can benefit from these open data and open software products – academic researchers from students through to professors, as well as academic administrators and software developers, are better served by having open access to the metadata that helps describe and map the environments in which they operate. The key reasons and use cases which motivate our commitment to open bibliography are:
  1. Access to Information. Open Bibliography empowers and encourages individuals and organisations of various sizes to contribute, edit, improve, link to and enhance the value of public domain bibliographic records.
  2. Error detection and correction. Community supporting the practice of Open Bibliography will rapidly add means of checking and validating the quality of open bibliographic data.
  3. Publication of small bibliographic datasets. It is common for individuals, departments and organisations to provide definitive lists of bibliographic records.
  4. Merging bibliographic collections. With open data, we can enable referencing and linking of records between collections.
  5. A bibliographic node in the Linked Open Data cloud. Communities can add their own linked and annotated bibliographic material to an open LOD cloud.
  6. Collaboration with other bibliographic organisations. Reference manager and identifier systems such as Zotero, Mendeley, CrossRef, and academic libraries and library organisations.
  7. Mapping scholarly research and activity. Open Bibliography can provide definitive records against which publication assessments can be collated, and by which collaborations can be identified.
  8. An Open catalogue of Open scholarship. Since the bibliographic record for an article is Open, it can be annotated to show the Openness of the article itself, thus bibliographic data can be openly enhanced to show to what extent a paper is open and freely available.
  9. Cataloguing diverse materials related to bibliographic records. We see the opportunity to list databases, websites, review articles and other information which the community may find valuable, and to associate such lists with open bibliographic records.
  10. Use and development of machine learning methods for bibliographic data processing. Widespread availability of open bibliographic data in machine-readable formats should rapidly promote the use and development of machine-learning algorithms.
  11. Promotion of community information services. Widespread availability of open bibliographic web services will make it easier for those interested in promoting the development of scientific communities to develop and maintain subject-specific community information.

Sustaining Open Bibliography

Using these products

The products of this project add strength to an ecosystem of ongoing efforts towards large scale open bibliographic (and other) collections. We encourage others to use tools such as the OpenBiblio software, and to take our visualisations as examples for further application. We will maintain our exemplars for at least one year from publication of this post, whilst the software and content remain openly available to the community in perpetuity. We would be happy to hear from members of the community interested in using our products.

Further collaborations and future work

We intend to continue to build on the output of this project; after the success of liberating large bibliographic collections and clarifying open bibliographic principles, the focus is now on managing personal / small collections. Collaborative efforts with the Bibliographic Knowledge network project have begun, and continuing development will make the aforementioned releases of large scale open bibliographic datasets directly relevant and beneficial to people in the academic community, by providing a way for individuals – or departments or research groups – to easily manage, present, and search their own bibliographic collections. Via collaboration with the Scholarly HTML community we intend to follow conventions for embedding bibliographic metadata within HTML documents whilst also enabling collection of such embedded records into BibJSON, thus allowing embedded metadata whilst also providing additional functionality similar to that demonstrated already, such as search and visualisation. We are also working towards ensuring compatibility between ScHTML and Schema.org, affording greater relevance and usability of ScHTML data. Success in these ongoing efforts will enable us to support large scale open bibliographic data, providing a strong basis for open scholarship in the future. We hope to attract further support and collaboration from groups that realise the importance of Open Source code, Open Data and Open Knowledge to the future of scholarship.

Project TOC

All the posts about our project can be viewed in chronological order on our site via the jiscopenbib tag. Posts fall into three main types, and the key posts are listed below. The three types reflect the core strands of this project – documenting our progress whilst adjusting objectives for the best outcome, detailing technical development to increase awareness and for future reference, and announcing data releases. Whilst project progress and technical reports may be more common, it is very important to us to ensure also that the open dataset commitments are understood to be key events in themselves; these are the events that set the example for other groups in the publishing community, and should demonstrate open releases as the “best practice” for the community.

Project progress

Technical reports

Data releases

Further information

Project particulars

This project started on 14th June 2010 and finished successfully and on time on 30th June 2011, with a total cost of £77050. This project was funded by JISC under the jiscEXPO stream of the INF11 programme. The PIMS URL for this project is https://pims.jisc.ac.uk/projects/view/1867.

Software and documentation links and licenses

The main software output of this project is the further developed OpenBiblio software, which is available with installation documentation at http://bitbucket.org/okfn/openbiblio. However, there were other developments done as further demonstrations over the course of the project, and each is detailed on the project blog. See the Project TOC Technical reports list for further information All the data that was released during this project fell under OKD compliant licenses such as PDDL or CC0, depending on that chosen by the publisher, and detailed in the aforementioned announcement posts. The content of this site is licensed under a Creative Commons Attribution 3.0 License (all jurisdictions).

Project team

  • Peter Murray-Rust – Principal Investigator
  • Rufus Pollock – Project manager
  • Ben O’Steen – Technical lead
  • Mark MacGillivray – Project management
  • Will Waites – Software developer
  • Richard Jones – Additional software development
  • Tatiana De La O – Additional software development
And with thanks to David Flanders – JISC Program Manager

Project partners

Collections in Bibliographica: unsorted information is not information

- June 12, 2011 in inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, News, OKFN Openbiblio, progressPosts, Semantic Web, WIN

Collections are the first feature aimed for our users participation at Bibliographica. The collections are lists of books users can create and share with others, and they are one of the basic features of Bibliographica as Jonathan Gray pointed out already:
lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing.
Details of use They are accessible via the collections link on the top menu of the website. To create collections you must be logged in. You may login on http://bibliographica.org/account/login with an openID Once logged in, every time you open a book page (i.e. http://bnb.bibliographica.org/entry/GB6502067 ) you will see at your right the Collections menu, where you can choose between creating a new collection with that work, or adding the work to an already existing collection. If you have created some collections you can always access them through the menu and they are also going to appear in your account page For removing a book from one collection, you can click remove in the collection listing of the sidebar. Collections screencast

Bibliographica gadget in Wikipedia

- June 6, 2011 in Bibliographic, inf11, jisc, JISC OpenBib, jiscEXPO, jiscLMS, jiscopenbib, News, OKFN Openbiblio, progress, progressPosts, Semantic Web, software

What is a wikipedia gadget? Thinking of ways to show the possibilities of linked data, we have made a Wikipedia gadget, making use of a great resource the Wikimedia developers give to the community. Wikipedia gadgets are small pieces of code you can add to your Wikipedia user templates, and allow you to add more functionality and render more information when you browse wikipedia pages. In our case, we wanted to retrieve information from our bibliographica site to render in Wikipedia, and so as the pages are rendered with specific markup we can use the ISBN numbers present on the wikipedia articles to make consults to the bibliographica database, in a way similar to what Mark has done with the Edinburgh International Science Festival. Bibliographica.org offers an isbn search endpoint at http://bibliographica.org/isbn/, so if we ask for the page http://bibliographica.org/isbn/0241105161 we receive [{"issued": "1981-01-01T00:00:00Z", "publisher": {"name": "Hamilton"}, "uri": "http://bnb.bibliographica.org/entry/GB8102507", "contributors": [{"name": "Boyd, William, 1952-"}], "title": "A good man in Africa"}] I can use this information to make a window pop up with more information about works when we hover their ISBNs on the Wikipedia pages. If my user templates has the bibliographica gadget, every time I open a wiki page the script will ask information about all the ISBNs the page has to our database. If something is found, it will render a frame around the ISBN numbers: And if I hover over them, I see a window with information about the book: Get the widget So, if you want to have this widget, first you need to create an account in the wikipedia, and then change your default template to add the JavaScript snippet. Once you do this (instructions here ) you will be able to get the information available in bibliographica about the books. Next steps By now, the interaction goes in just one direction. Later on, we will be able to feed that information back to Bibliographica.

Medline dataset

- May 23, 2011 in announcement, Bibliographic, communityBenefits, Data, inf11, institutionalBenefits, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, News, OKFN Openbiblio, progress, progressPosts, Semantic Web, WIN

Announcing the CC0 Medline dataset

We are happy to report that we now have a full, clean public domain (CC0) version of the Medline dataset available for use by the community.

What is the Medline dataset?

The Medline dataset is a subset of bibliographic metadata covering approximately 98% of all PubMed publications. The dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. There are approximately 19 million publication records. Medline is a maintained dataset, and updates chronologically append to the current dataset. Read our explanation of the different PubMed datasets for further information.

Where to get it

The raw dataset can be downloaded from CKAN : http://ckan.net/package/medline

What is in a record

Most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID. Many also have DOIs. We have stripped out any potentially copyrightable material such as abstracts. Read our technical description of a record for further information.

Sample usage

We have made an online visualisation of a sample of the Medline dataset – however the visualisation relies on WebGL which is not yet widely supported by all browsers. It should work in Chrome and probably FireFox4. This is just one example, but shows what great things we can build and learn from when we have open access to the necessary data to do so.

OpenBiblio workshop report

- May 9, 2011 in Bibliographic, BibServer, communityBenefits, Data, event, inf11, jisc, JISC OpenBib, jiscEXPO, jiscLMS, jiscopenbib, OKFN Openbiblio, progress, progressPosts, rdf, Semantic Web, WIN

#openbiblio #jiscopenbib The OpenBiblio workshop took place on 6th May 2011, at London Knowledge Lab

Participants

  • Peter Murray-Rust (Open Bibliography project, University of Cambridge, IUCr)
  • Mark MacGillivray (Open Bibliography project, University of Edinburgh, OKF, Cottage Labs)
  • William Waites (Open Bibliography project, University of Edinburgh, OKF)
  • Ben O’Steen (Open Bibliography project, Cottage Labs)
  • Alex Dutton (Open Citation project, University of Oxford)
  • Owen Stephens (Open Bibliographic Data guide project, Open University)
  • Neil Wilson (British Library)
  • Richard Jones (Cottage Labs)
  • David Flanders (JISC)
  • Jim Pitman (Bibserver project, UCB) (remote)
  • Adrian Pohl (OKF bibliographic working group) (remote)
During the workshop we covered some key areas where we have seen some success already in the project, and discussed how we could continue further.

Open bibliographic data formats

In order to ensure successful sharing of bibliographic data, we require agreement on a suitable yet simple format via which to disseminate records. Whilst representing linked data is valuable, it also adds complexity; however, simplicity is key for ensuring uptake and for enabling easy front end system development. Whilst data is available as RDF/XML, JSON is now a very popular format for data transfer, particularly where front end systems are concerned. We considered various JSON linked data formats, and have implemented two for further evaluation. In order to make sure this development work is as widely applicable as possible, we wrote parsers and serialisers for JSON-LD and RDF/JSON as plugins for the popular RDFlib. The RDF/JSON format is, of course, RDF; therefore, it requires no further change to enable it to handle our data, and our RDF/JSON parser and serialiser are already complete. However, it is not very JSON-like, as data takes the subject(predicate(object)) form rather than the general key:value form. This is where JSON-LD can improve the situation – it provides for listing information in a more key:value-like format, making it easier for front end developers not interested in the RDF relations to utilise. But this leads to additional complexity in the spec and parsing requirements, so we have some further work to complete: * remove angle brackets from blank nodes * use type coersion to move types out of main code * use language coersion to omit languages Our code is currently available in our repository, and we will request that our parsers and serialisers get added to RDFlib or to RDFextras once they are complete (they are still in development at present). To further assist in representing bibliographic information in JSON, we also intend to implement BibJSON within JSON-LD; this should provide the necessary lined data functionality where necessary via JSON-LD support, whilst also enabling simpler representation of bibliographic data via key:value pairs where that is all that is required. By making these options available to our users, we will be able to gauge the most popular representation format. Regardless of format used, a critical consideration is that of stable references to data. Without this maintaining datasets will be very hard. To date, the British Library data for example does not have suitable identifiers. However, the BL are moving forward with applying identifiers and will be issuing a new version of their dataset soon, which we will take as a new starting point. We have provided a list of records that we have identified as non-unique, and in turn the BL will share the tools they use to manage and convert data where possible, to enable better community collaboration.

Getting more open datasets

We are building on the success of the BL data release by continuing work on our CUL and IUCr data, and also by getting more datasets. The latest is the Medline dataset; there were some initial issues with properly identifying this dataset, so we have a previous blog post and a link to further information, the Medline DTD and specifications of the PubMed data elements to help.

The Medline dataset

We are very excited to have the Medline dataset; we are currently working on cleaning so that we can provide access to all the non-copyrightable material it contains, which should represent a listing of about 98% of all articles published in PubMed. The Medline dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. This also means that further updates will be trackable as they will append to the current dataset. We have found that most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID, and that some contain further metadata such as citations, which we will remove. Once this is done, and we have checked that there are unique IDs (e.g. that the PubMed IDs are unique) we will make the raw CC0 collection available, then attempt to get it into our Bibliographica instance. We will then also be able to generate visualisations on our total dataset, which we hope will be approaching 30 million records by the end of the JISC Open Bibliography project.

Displaying bibliographic records

Whilst Bibliographica allows for display of individual bibliographic records and enables building collections of such records, it does not yet provide a means of neatly displaying lists of bibliographic records. We have partnered with Jim Pitman of Berkeley University to develop his BibServer to fit this requirement, and also to bring further functionality such as search and faceted browse. This also provides further development direction for the output of the project beyond the July end date of the JISC Open Bibliography project.

Searching bibliographic records

Given the collaboration between Bibliographica and BibServer on collection and display of bibliographic records, we are also considering ways to enable search across non-copyrightable bibliographic metadata relating to any published article. We believe this may be achievable by building a collection of DOIs with relevant metadata, and enabling crowdsourcing of updates and comments. This effort is separate to the main development of the projects, however would make a very good addition both to the functionality of developed software and to the community. This would also tie in with any future functionality that enables author identification and information retrieval, such as ORCID, and allowing us to build on the work done at sites such as BIBKN

Disambiguation without deduplication

There have been a number of experiments recently highlighting the fact that a simple LUCENE search index over datasets tends to give better matches than more complex methods of identifying duplicates. Ben O’Steen and Alex Dutton both provided examples of this, from their work with the Open Citation project. This is also supported by a recent paper from Jeff Bilder entitled “Disambiguation without Deduplication” (not publicly available). The main point here is that instead of deduplicating objects we can simply do machine disambiguation and make sameAs-ness assertions between multiple objects; this would enable changes to still be applied to different versions of an object by disparate groups (e.g. where each group has a different spelling or identifier, perhaps, for some key part of the record) whilst still maintaining a relationship between the two objects. We could build on this sort of functionality by applying expertise from the library community if necessary, although deduplication/merging should only be contemplated if there is a new dataset being formed which some agent is taking responsibility to curate. If not, better to just cluster the data by SameAs assertions, and keep track of who is making those assertions, to assess their reliability. We suggest a concept for increasing collaboration on this sort of work – a ReCaptcha of identities. Upon login, perhaps to a Bibliographica or another relevant system, a user could be presented with two questions, one of which we know the answer to, and the other being a request to match identical objects. This, in combination with decent open source software tools enabling bibliographic data management (building on tools such as Google Refine and Needlebase), would allow for simple verifiable disambiguation across large datasets.

Sustaining open bibliographic data

Having had success in getting open bibliographic datasets and prototyping their availability, we must consider how to maintain long term open access. There are three key issues:

Continuing community engagement

We must continue to work with the community, and to provide explanatory information to those needing to make decisions about bibliographic data, such as the OpenBiblio Principles and the Open BIbliographic Data guide. We must also ensure we improve resource discovery by supporting the requirement for generating collections and searching content. Additionally, quality bibliographic data should be hosted at some key sites – there are a variety of options such as Freebase, CKAN, bibliographica – but we must also ensure that community members can be crowdsourced both for managing records within these central options and also for providing access to smaller distributed nodes, where data can be owned and maintained at the local level whilst being discoverable globally.

Maintaining datasets

Dataset maintenance is critical to ongoing success – stale data is of little use to people and disregard for content maintenance will put off new users. We must co-ordinate with source providers such as the BL by accepting changesets from them and incorporating that into other versions. This is already possible with the Medline data, for example, and will very soon be the case with BL updates too. We should advocate for this method of dataset updates during any future open data negotiations. This will allow us to keep our datasets fresh and relevant, and to properly represent growing datasets. We must continue to promote open access to non-copyrightable datasets, and ensure that there is a location for open data providers to easily make their raw datasets available – such as CKAN. We will ensure that all the software we have developed during the course of the project – and in future – will remain open source and publicly available, so that it will be possible for anyone to perform the transforms and services that we can perform.

Community involvement with dataset maintenance

We should support community members that wish to take responsibility for overseeing updating of datasets. This is critical for long term sustainability, but hard to find. These people need to be recruited and provided with simple tools which will empower them to easily maintain and share datasets they care about with a minimal time commitment. Thus we must make sure that our software and tools are not only open source, but usable by non-team members. We will work on developing tools such as ReCaptcha for disambiguation, and on building game / rank table functionality for those wishing to participate in entity disambiguation (in addition to machine disambiguation).

Critical mass

We hope that by providing almost 30 million records to the community under CC0 license, and with the support of all the providers that made this possible, we will achieve a critical mass of data, and an exemplar for future open access to such data. This should provide the go-to list of such information, and inspire others to contribute and maintain. However, such community assistance will only continue for as long as there appears to be reasonable maintenance of the corpus and software we have already developed – if this slips into disrepair, community engagement is far less likely.

Maintaining services

The bibliographica service that we currently run already requires significant hardware to run. Once we add in Medline data, we will require very large indexes, requiring a great deal of RAM and fast disks. There is therefore a long term maintenance requirement implicit in running any such central service of open bibliographic data on this scale. We will present a case for ongoing funding requirements and seek sources for financial support both for technical maintenance and for ongoing software maintenance and community engagement.

Business cases

In order to ensure future engagement with groups and business entities, we must make clear examples of the benefits of open bibliographic data. We have already done some work on visualising the underlying data, which we will develop further for higher impact. We will identify key figures in the data that we can feed into such representations to act as exemplars. Additionally, we will continue to develop mashups using the datasets, to show the serendipitous benefit that increases exposure but is only possible with unambiguously open access to useful data.

Events and announcements

We will continue to promote our work and the efforts of our partners, and advocate further for open bibliography, by publicising our successes so far. We will co-ordinate this with JISC, BL, OKF and other interested groups, to ensure the impact of announcements by all groups are enhanced. We will present our work at further events throughout the year, such as attendance and sessions at OKCon, OR11 and other conferences, and by arranging further hackdays.

Follow-up to serialising RDF in JSON

- May 5, 2011 in BibServer, Data, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, OKFN Openbiblio, ontology, progress, rdf, Semantic Web

Following on from Richard’s post yesterday, we now have a JSON-LD serialiser for RDFlib. This is still a work in progress, and there may be things that it is serialising incorrectly. So, please give us feedback on this, and tell us where we have misinterpreted the structure. Here you will find a sample JSON-LD output file, which was generated from this Bibliographica record. The particular area of concern surrounds how the JSON-LD spec describes serialising disjoint graphs into JSON-LD (section 8.2). How does this differ from serialising joined graphs? We are presuming all that our output file is an example of a joined graph, and that additional disjoint graphs would be added by appending additional @:[] sections.

Comparative Serialisation of RDF in JSON

- May 4, 2011 in BibServer, Data, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, model, OKFN Openbiblio, ontology, outputs, progress, progressPosts, rdf, Semantic Web

This is a comparison of RDF-JSON and JSON-LD for serialising bibliographic RDF data. Given that we are also working with BibServer we have taken a BibJSON document as our source data for comparison. The objective was to both understand these two JSON serialisations of RDF and also to look at the BibJSON profile to see how it fits into such a framework. Due to limitations of the display of large plain-text code snippets on the site, we have placed the actual content in this text file which you should refer to as we go along. We used a BibJSON document, which comes from the examples on the BibJSON homepage. When converting this into the two RDF serialisations we invent a namespace
http://www.bibkn.org/bibjson/terms/
This namespace provisionally holds all predicates/keys that are used by BibJSON and are not immediately clearly available in another ontology. These terms should not under any circumstances be considered definitive or final, only indicative. Now consider the RDF-JSON serialisation Some key things to note about this serialisation:
  • There is no explicit shortening of URIs for predicates into CURIEs, all URIs are instead presented in full.
  • The subject of each predicate is a JSON object with up to 4 keys (value, type, datatype, lang). This means that it is not easy for the human eye to pick out the value of a particular predicate.
  • Of the two RDF serialisations, this is by far the most verbose
  • It is relatively difficult for a human to read and write
Compare this with the equivalent JSON-LD serialisation: Some things to note about this serialisation:
  • It has a clear treatment of namespaces
  • It may be slightly inaccurate, as there are some parts of its specification which are ambiguous – feedback welcome
  • The object values cannot be taken as the value of the predicate, as they may contain datatype and/or language information in them, or may be surrounded by angled brackets.
  • It is relatively easy for a human to read and write
Both serialisations are capable of representing the same data, although JSON-LD is far more terse and therefore easier to read and write. It is not, however, possible to reliably treat JSON-LD as a pure list of key-value pairs in non-RDF aware environments, as it includes RDF type and language semantics in the literal values of objects. RDF-JSON does not suffer from this same issue within the object literals, but in return its notation is more complex. A serious lacking in RDF-JSON is explicit handling of CURIEs and namespaces, and it could benefit from adopting the conventions laid out in JSON-LD – this may bring the choice of which serialisation to use down to preference rather than relying on any significant technical differences. Each of the formats also comfortably represents BibJSON, and with the extensive lists of predicates provided in that specification it would be straightforward enough to do a full and proper treatment of BibJSON through one of these routes.

Getting open bibliographic data from (UK)PMC / PubMed

- May 3, 2011 in Bibliographic, Data, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, OKFN Openbiblio, progress, progressPosts

For some time now, the JISC Open Bibliography project team has been attempting to get open bibliographic data from (UK)PMC / PubMed. Everyone involved (Robert Kiley – Wellcome, Ben O’Steen, Peter Murray-Rust – JISC OpenBib, Jeff Beck – NIH/NLM/NCBI, Johanna McEntyre) has worked hard to achieve this, but attempts have been hampered by ambiguities and technical restrictions. The purpose of this post is to clarify and highlight these issues as examples of stumbling blocks on any path to linked open data, to specify what it is we are trying to achieve at present, and learn how to improve this process.

WHAT WE ARE TRYING TO DO

Closed access to bibliography is dangerous – it actually holds back the scientific discovery process. We therefore believe it is important to have an authoritative Open collection of bibliographic records. This acts as a primary resource for the community which they can use for normalisation, discovery, annotation, etc. We seek confirmation that we can have programmatic access to the approximately twenty million or so records in PubMed. NCBI for example should be able to say: “these are the articles which we have in Pubmed” without breaking any laws or contracts. These articles would be identified by their core bibliographic data.

PROBLEMS

  • We received an original email last year stating that we could have such access to PubMed, but it has become unclear what PubMed is.
  • Identifying the correct content is not straightforward – are we talking about PMC / UKPMC / PubMed / Open Access subset?
  • What licenses are involved and on which subsets do open licenses such as CC0 apply?
  • These datasets are very large, so incremental and recordset-by-recordset requests to servers have resulted in roadblocks such as timeouts and errors.

WHAT DATASET ARE WE TALKING ABOUT

  • The 2 million articles in PMC are NOT all open access. There are 251,129 articles (approx 12% of PMC) that are in the open access subset.
  • Although there are 2 million or so articles in PMC which anyone can look at, print out etc, only 251k of these have an OA licence which allows people to re-use the content, including creating derivative works.
  • PMC and UKPMC have approximately the same full-text content. There are a small minority of journals which refused to allow their content to be mirrored to UKPMC.
  • The distinction between “public access” content and “open access” articles (i.e 0.25m articles) is irrelevant, as we are only interested in the bibliographic record, not the content.
  • For current purposes PMC and UKPMC can be used interchangeably.
  • PMC is only a subset of PubMed – which contains about twenty million records, the totality of content in NIH / NLM / NCBI.
  • The MEDLINE dataset is a subset of about 98% of PubMed.
  • However we believe, as per previous discussions, that the legal situation applies equally to PubMed as to the PMC.
  • So we are looking for every bibliographic record in PubMed (or MEDLINE if that is easier to acquire).

WHAT DO WE MEAN BY BIBLIOGRAPHIC RECORD

  • “Bibliography” is sometimes used as synonymous with “a given collection of bibliographic records”. Consider “the bibliographic data for Pubmed”; what we are interested in is enumerating individual bibliographic records.
  • “Citation” often refers to the reference within the fulltext to another publication (via its bibliographic record). The list of citations is not in general Open except in Open Access journals.
  • For the purposes of Open Bibliography we are restricting our discussion to what we call core bibliographic data (described in the open bibliographic data principles)
  • We regard the core bibliographic data as uncopyrightable, and generally acknowledged to be necessarily Open.
  • This core bibliographic data is what we mean by the bibliographic record.
  • Such records are unoriginal and inevitable, being the only way of actually identifying a work.
  • Although collections of bibliographic data are copyrightable (at least in Europe) because they are the result of the creative act of assembling a set of records, the individual records are not.
  • There is no creative act in compiling the list of bibliographic records held by NCBI/Pubmed as it is an exhaustive enumeration.
  • We believe that there is no moral case and probably no legal case for regarding these as the property of the publisher.

WHAT DO WE NOT MEAN BY BIBLIOGRAPHIC RECORD

  • As abstracts appear to be copyrightable we do not include abstracts, or annotations.
  • If it is not in the open bibliographic principles, we do not consider it to be in the bibliographic record.

WHAT WE HOPE TO GET NOW

  • Due to issues with programmatic access to PMC / PubMed dataset (restrictions on requests to the servers that contain them, we request a dump of the MEDLINE dataset.
  • This represents about 98% of PubMed which we believe is or should be available as CC0.
  • As MEDLINE also has incremental updates, we request ongoing access to those, to allow change tracking and synchronisation.
  • We have have filled in the automatic leasing form for the MEDLINE set a few times since February, (most recent attempt was at the end of April.)
  • We hope that the position is now clearly stated in this post, and await confirmation.
  • Upon agreement we look forward to receiving the XML files containing the MEDLINE dataset, from which we will extract the aforementioned unoriginal and re-usable bibliographic data.
We look forward to resolving this, to receiving the data, and to helping to make it openly available.

Bibliographica and Edinburgh International Science Festival

- April 11, 2011 in Data, event, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, OKFN Openbiblio, progressPosts, WIN

This weekend I was trying to build a useful search tool to help my wife find interesting events on at the Edinburgh International Science Festival. One problem was that the dataset was poor, and the descriptions did not always give a lot of detail. I attempted to rectify this by hooking up the events to bibliographica. Now, you can filter events then select “more” to see further details and a list of relevant publications based on the event speakers and the event theme; this can give a slightly better idea of what might be going on, as you can review the published work of those involved. http://eisf.cottagelabs.com Unfortunately, the data does still have quite a few errors, and I have not ensured that names tie up properly, so the results are not always perfect. But still, it is quite a good demonstration. It would be even better with journal articles to search across.

open theses at EURODOC

- April 7, 2011 in Bibliographic, communityBenefits, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, progress, progressPosts, WIN

#jiscopenbib #opentheses On Friday 1st April 2011, Mark MacGillivray, Peter Murray-Rust and Ben O’Steen remotely attended the EURODOC conference in Vilnius, Lithuania in order to take part in an Open Theses workshop locally hosted by Daniel Mietchen and Alfredo Ferreira (funded by the JISC Open Bib project to attend in person). During the workshop we began laying the foundations for open theses in Europe, discussing with current and recently finished postgraduate students and collecting data from those present and from anyone else interested. As described by Peter prior to the event:
As part of our JISCOpenBIB project we are running a workshop on Open Theses at EURODOC 2011. “We” is an extended community of volunteers centered round the main JISC project. In that project we have developed an approach to the representation of Open Bibliographic metadata, and now we are extending this to theses.

Why theses? Because, surprisingly, many theses are not easily discoverable outside their universities. So we are running the workshop to see how much metadata we can collect on European theses. Things like name, university, subject, datae, title – standard metadata.

We have the beginnings of a dataset at: https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dHFTNDhJU0xfdGhIT01WeTBMMDZWOGc&hl=en_GB&authkey=CJuy4owB The content of this datasheet will hopefully be used to populate an open theses collection in bibliographica, and in addition it is powering a mashup that will allow us to view at a glance the theses that have been published across the world, and where possible a link to the work itself: http://benosteen.com/eurodoc.html We also have a survey to fill in, to collect opinion around copyright issues for current / soon to be published theses, based at: http://openbiblio.net/opentheses-survey/ The data collected by this survey is available at: https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dDN1cHQ3TDJpYWRaWmkxWlFDS2lMWXc&hl=en_GB&authkey=CMKN-O8I#gid=0