You are browsing the archive for WIN.

Open source development – how we are doing

- May 29, 2012 in BibServer, JISC OpenBib, jiscopenbib2, licensing, progress, progressPosts, projectMethodology, projectPlan, riskAnalysis, software, WIN, wp10, wp2, wp3, wp6, wp9

Whilst at Open Source Junction earlier this year, I talked to Sander van der Waal and Rowan Wilson about the problems of doing open source development. Sander and Rowan work at OSS watch, and their aim is to make sure that open source software development delivers its potential to UK HEI and research; so, I thought it would be good to get their feedback on how our project is doing, and if there is anything we are getting wrong or could improve on. It struck me that as other JISC projects such as ours are required to make their output similarly publicly available, this discussion may be of benefit to others; after all, not everyone knows what open source software is, let alone the complexities that can arise from trying to create such software. Whilst we cannot help avoid all such complexities, we can at least detail what we have found helpful to date, and how OSS Watch view our efforts. I provided Sander and Rowan a review of our project, and Rowan provided some feedback confirming that overall we are doing a good job, although we lack a listing of the other open source software our project relies on, and their licenses. Whilst such data can be discerned from the dependencies of the project, this is not clear enough; I will add a written list of dependencies to the README. The response we received is provided below, followed by the overview I initially provided, which gives a brief overview of how we managed our open source development efforts: ==== Rowan Wilson, OSS Watch, responds: Your work on this project is extremely impressive. You have the systems in place that we recommend for open development and creation of community around software, and you are using them. As an outsider I am able to quickly see that your project is active and the mailing list and roadmap present information about ways in which I could participate. One thing I could not find, although this may be my fault, is a list of third party software within the distribution. This may well be because there is none, but it’s something I would generally be keen to see for the purposes of auditing licence compatibility. Overall though I commend you on how tangible and visible the development work on this project is, and on the focus on user-base expansion that is evident on the mailing list. ==== Mark MacGillivray wrote: Background – May 2011, OKF / AIM bibserver project Open Knowledge Foundation contracted with American Institute of Mathematics under the direction of Jim Pitman in the dept. of Maths and Stats at UC Berkeley. The purpose of the project was to create an open source software repository named BibServer, and to develop a software tool that could be deployed by anyone requiring an easy way to put and share bibliographic records online. A repository was created at http://github.com/okfn/bibserver, and it performs the usual logging of commits and other activities expected of a modern DVCS system. This work was completed in September 2011, and the repository has been available since the start of that project with a GNU Affero GPL v3 licence attached. October 2011 – JISC Open Biblio 2 project The JISC Open BIblio 2 project chose to build on the open source software tool named BibServer. As there was no support from AIM for maintaining the BibServer repository, the project took on maintenance of the repository and all further development work, with no change to previous licence conditions. We made this choice as we perceive open source licensing as a benefit rather than a threat; it fit very well with the requirements of JISC and with the desires of the developers involved in the project. At worst, an owner may change the licence attached to some software, but even in such a situation we could continue our work by forking from the last available open source version (presuming that licence conditions cannot be altered retrospectively). The code continues to display the licence under which it is available, and remains publicly downloadable at http://github.com/okfn/bibserver. Should this hosting resource become publicly unavailable, an alternative public host would be sought. Development work and discussion has been managed publicly, via a combination of the project website at http://openbiblio.net/p/jiscopenbib2, the issue tracker at http://github.com/okfn/bibserver/issues, a project wiki at http://wiki.okfn.org/Projects/openbibliography, and via a mailing list at openbiblio-dev@lists.okfn.org February 2012 – JISC Open Biblio 2 offers bibsoup.net beta service In February the JISC Open Biblio 2 project announced a beta service available online for free public use at http://bibsoup.net. The website runs an instance of BibServer, and highlights that the code is open source and available (linking to the repository) to anyone who wishes to use it. Current status We believe that we have made sensible decisions in choosing open source software for our project, and have made all efforts to promote the fact that the code is freely and publicly available. We have found the open source development paradigm to be highly beneficial – it has enabled us to publicly share all the work we have done on the project, increasing engagement with potential users and also with collaborators; we have also been able to take advantage of other open source software during the project, incorporating it into our work to enable faster development and improved outcomes. We continue to develop code for the benefit of people wishing to publicly put and share their bibliographies online, and all our outputs will continue to be publicly available beyond the end of the current project.

Collections in Bibliographica: unsorted information is not information

- June 12, 2011 in inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, News, OKFN Openbiblio, progressPosts, Semantic Web, WIN

Collections are the first feature aimed for our users participation at Bibliographica. The collections are lists of books users can create and share with others, and they are one of the basic features of Bibliographica as Jonathan Gray pointed out already:
lists of publications are an absolutely critical part of scholarship. They articulate the contours of a body of knowledge, and define the scope and focus of scholarly enquiry in a given domain. Furthermore such lists are always changing.
Details of use They are accessible via the collections link on the top menu of the website. To create collections you must be logged in. You may login on http://bibliographica.org/account/login with an openID Once logged in, every time you open a book page (i.e. http://bnb.bibliographica.org/entry/GB6502067 ) you will see at your right the Collections menu, where you can choose between creating a new collection with that work, or adding the work to an already existing collection. If you have created some collections you can always access them through the menu and they are also going to appear in your account page For removing a book from one collection, you can click remove in the collection listing of the sidebar. Collections screencast

Medline dataset

- May 23, 2011 in announcement, Bibliographic, communityBenefits, Data, inf11, institutionalBenefits, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, News, OKFN Openbiblio, progress, progressPosts, Semantic Web, WIN

Announcing the CC0 Medline dataset

We are happy to report that we now have a full, clean public domain (CC0) version of the Medline dataset available for use by the community.

What is the Medline dataset?

The Medline dataset is a subset of bibliographic metadata covering approximately 98% of all PubMed publications. The dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. There are approximately 19 million publication records. Medline is a maintained dataset, and updates chronologically append to the current dataset. Read our explanation of the different PubMed datasets for further information.

Where to get it

The raw dataset can be downloaded from CKAN : http://ckan.net/package/medline

What is in a record

Most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID. Many also have DOIs. We have stripped out any potentially copyrightable material such as abstracts. Read our technical description of a record for further information.

Sample usage

We have made an online visualisation of a sample of the Medline dataset – however the visualisation relies on WebGL which is not yet widely supported by all browsers. It should work in Chrome and probably FireFox4. This is just one example, but shows what great things we can build and learn from when we have open access to the necessary data to do so.

OpenBiblio workshop report

- May 9, 2011 in Bibliographic, BibServer, communityBenefits, Data, event, inf11, jisc, JISC OpenBib, jiscEXPO, jiscLMS, jiscopenbib, OKFN Openbiblio, progress, progressPosts, rdf, Semantic Web, WIN

#openbiblio #jiscopenbib The OpenBiblio workshop took place on 6th May 2011, at London Knowledge Lab

Participants

  • Peter Murray-Rust (Open Bibliography project, University of Cambridge, IUCr)
  • Mark MacGillivray (Open Bibliography project, University of Edinburgh, OKF, Cottage Labs)
  • William Waites (Open Bibliography project, University of Edinburgh, OKF)
  • Ben O’Steen (Open Bibliography project, Cottage Labs)
  • Alex Dutton (Open Citation project, University of Oxford)
  • Owen Stephens (Open Bibliographic Data guide project, Open University)
  • Neil Wilson (British Library)
  • Richard Jones (Cottage Labs)
  • David Flanders (JISC)
  • Jim Pitman (Bibserver project, UCB) (remote)
  • Adrian Pohl (OKF bibliographic working group) (remote)
During the workshop we covered some key areas where we have seen some success already in the project, and discussed how we could continue further.

Open bibliographic data formats

In order to ensure successful sharing of bibliographic data, we require agreement on a suitable yet simple format via which to disseminate records. Whilst representing linked data is valuable, it also adds complexity; however, simplicity is key for ensuring uptake and for enabling easy front end system development. Whilst data is available as RDF/XML, JSON is now a very popular format for data transfer, particularly where front end systems are concerned. We considered various JSON linked data formats, and have implemented two for further evaluation. In order to make sure this development work is as widely applicable as possible, we wrote parsers and serialisers for JSON-LD and RDF/JSON as plugins for the popular RDFlib. The RDF/JSON format is, of course, RDF; therefore, it requires no further change to enable it to handle our data, and our RDF/JSON parser and serialiser are already complete. However, it is not very JSON-like, as data takes the subject(predicate(object)) form rather than the general key:value form. This is where JSON-LD can improve the situation – it provides for listing information in a more key:value-like format, making it easier for front end developers not interested in the RDF relations to utilise. But this leads to additional complexity in the spec and parsing requirements, so we have some further work to complete: * remove angle brackets from blank nodes * use type coersion to move types out of main code * use language coersion to omit languages Our code is currently available in our repository, and we will request that our parsers and serialisers get added to RDFlib or to RDFextras once they are complete (they are still in development at present). To further assist in representing bibliographic information in JSON, we also intend to implement BibJSON within JSON-LD; this should provide the necessary lined data functionality where necessary via JSON-LD support, whilst also enabling simpler representation of bibliographic data via key:value pairs where that is all that is required. By making these options available to our users, we will be able to gauge the most popular representation format. Regardless of format used, a critical consideration is that of stable references to data. Without this maintaining datasets will be very hard. To date, the British Library data for example does not have suitable identifiers. However, the BL are moving forward with applying identifiers and will be issuing a new version of their dataset soon, which we will take as a new starting point. We have provided a list of records that we have identified as non-unique, and in turn the BL will share the tools they use to manage and convert data where possible, to enable better community collaboration.

Getting more open datasets

We are building on the success of the BL data release by continuing work on our CUL and IUCr data, and also by getting more datasets. The latest is the Medline dataset; there were some initial issues with properly identifying this dataset, so we have a previous blog post and a link to further information, the Medline DTD and specifications of the PubMed data elements to help.

The Medline dataset

We are very excited to have the Medline dataset; we are currently working on cleaning so that we can provide access to all the non-copyrightable material it contains, which should represent a listing of about 98% of all articles published in PubMed. The Medline dataset comes as a package of approximately 653 XML files, chronologically listing records in terms of the date the record was created. This also means that further updates will be trackable as they will append to the current dataset. We have found that most records contain useful non-copyrightable bibliographic metadata such as author, title, journal, PubMed record ID, and that some contain further metadata such as citations, which we will remove. Once this is done, and we have checked that there are unique IDs (e.g. that the PubMed IDs are unique) we will make the raw CC0 collection available, then attempt to get it into our Bibliographica instance. We will then also be able to generate visualisations on our total dataset, which we hope will be approaching 30 million records by the end of the JISC Open Bibliography project.

Displaying bibliographic records

Whilst Bibliographica allows for display of individual bibliographic records and enables building collections of such records, it does not yet provide a means of neatly displaying lists of bibliographic records. We have partnered with Jim Pitman of Berkeley University to develop his BibServer to fit this requirement, and also to bring further functionality such as search and faceted browse. This also provides further development direction for the output of the project beyond the July end date of the JISC Open Bibliography project.

Searching bibliographic records

Given the collaboration between Bibliographica and BibServer on collection and display of bibliographic records, we are also considering ways to enable search across non-copyrightable bibliographic metadata relating to any published article. We believe this may be achievable by building a collection of DOIs with relevant metadata, and enabling crowdsourcing of updates and comments. This effort is separate to the main development of the projects, however would make a very good addition both to the functionality of developed software and to the community. This would also tie in with any future functionality that enables author identification and information retrieval, such as ORCID, and allowing us to build on the work done at sites such as BIBKN

Disambiguation without deduplication

There have been a number of experiments recently highlighting the fact that a simple LUCENE search index over datasets tends to give better matches than more complex methods of identifying duplicates. Ben O’Steen and Alex Dutton both provided examples of this, from their work with the Open Citation project. This is also supported by a recent paper from Jeff Bilder entitled “Disambiguation without Deduplication” (not publicly available). The main point here is that instead of deduplicating objects we can simply do machine disambiguation and make sameAs-ness assertions between multiple objects; this would enable changes to still be applied to different versions of an object by disparate groups (e.g. where each group has a different spelling or identifier, perhaps, for some key part of the record) whilst still maintaining a relationship between the two objects. We could build on this sort of functionality by applying expertise from the library community if necessary, although deduplication/merging should only be contemplated if there is a new dataset being formed which some agent is taking responsibility to curate. If not, better to just cluster the data by SameAs assertions, and keep track of who is making those assertions, to assess their reliability. We suggest a concept for increasing collaboration on this sort of work – a ReCaptcha of identities. Upon login, perhaps to a Bibliographica or another relevant system, a user could be presented with two questions, one of which we know the answer to, and the other being a request to match identical objects. This, in combination with decent open source software tools enabling bibliographic data management (building on tools such as Google Refine and Needlebase), would allow for simple verifiable disambiguation across large datasets.

Sustaining open bibliographic data

Having had success in getting open bibliographic datasets and prototyping their availability, we must consider how to maintain long term open access. There are three key issues:

Continuing community engagement

We must continue to work with the community, and to provide explanatory information to those needing to make decisions about bibliographic data, such as the OpenBiblio Principles and the Open BIbliographic Data guide. We must also ensure we improve resource discovery by supporting the requirement for generating collections and searching content. Additionally, quality bibliographic data should be hosted at some key sites – there are a variety of options such as Freebase, CKAN, bibliographica – but we must also ensure that community members can be crowdsourced both for managing records within these central options and also for providing access to smaller distributed nodes, where data can be owned and maintained at the local level whilst being discoverable globally.

Maintaining datasets

Dataset maintenance is critical to ongoing success – stale data is of little use to people and disregard for content maintenance will put off new users. We must co-ordinate with source providers such as the BL by accepting changesets from them and incorporating that into other versions. This is already possible with the Medline data, for example, and will very soon be the case with BL updates too. We should advocate for this method of dataset updates during any future open data negotiations. This will allow us to keep our datasets fresh and relevant, and to properly represent growing datasets. We must continue to promote open access to non-copyrightable datasets, and ensure that there is a location for open data providers to easily make their raw datasets available – such as CKAN. We will ensure that all the software we have developed during the course of the project – and in future – will remain open source and publicly available, so that it will be possible for anyone to perform the transforms and services that we can perform.

Community involvement with dataset maintenance

We should support community members that wish to take responsibility for overseeing updating of datasets. This is critical for long term sustainability, but hard to find. These people need to be recruited and provided with simple tools which will empower them to easily maintain and share datasets they care about with a minimal time commitment. Thus we must make sure that our software and tools are not only open source, but usable by non-team members. We will work on developing tools such as ReCaptcha for disambiguation, and on building game / rank table functionality for those wishing to participate in entity disambiguation (in addition to machine disambiguation).

Critical mass

We hope that by providing almost 30 million records to the community under CC0 license, and with the support of all the providers that made this possible, we will achieve a critical mass of data, and an exemplar for future open access to such data. This should provide the go-to list of such information, and inspire others to contribute and maintain. However, such community assistance will only continue for as long as there appears to be reasonable maintenance of the corpus and software we have already developed – if this slips into disrepair, community engagement is far less likely.

Maintaining services

The bibliographica service that we currently run already requires significant hardware to run. Once we add in Medline data, we will require very large indexes, requiring a great deal of RAM and fast disks. There is therefore a long term maintenance requirement implicit in running any such central service of open bibliographic data on this scale. We will present a case for ongoing funding requirements and seek sources for financial support both for technical maintenance and for ongoing software maintenance and community engagement.

Business cases

In order to ensure future engagement with groups and business entities, we must make clear examples of the benefits of open bibliographic data. We have already done some work on visualising the underlying data, which we will develop further for higher impact. We will identify key figures in the data that we can feed into such representations to act as exemplars. Additionally, we will continue to develop mashups using the datasets, to show the serendipitous benefit that increases exposure but is only possible with unambiguously open access to useful data.

Events and announcements

We will continue to promote our work and the efforts of our partners, and advocate further for open bibliography, by publicising our successes so far. We will co-ordinate this with JISC, BL, OKF and other interested groups, to ensure the impact of announcements by all groups are enhanced. We will present our work at further events throughout the year, such as attendance and sessions at OKCon, OR11 and other conferences, and by arranging further hackdays.

Bibliographica and Edinburgh International Science Festival

- April 11, 2011 in Data, event, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, OKFN Openbiblio, progressPosts, WIN

This weekend I was trying to build a useful search tool to help my wife find interesting events on at the Edinburgh International Science Festival. One problem was that the dataset was poor, and the descriptions did not always give a lot of detail. I attempted to rectify this by hooking up the events to bibliographica. Now, you can filter events then select “more” to see further details and a list of relevant publications based on the event speakers and the event theme; this can give a slightly better idea of what might be going on, as you can review the published work of those involved. http://eisf.cottagelabs.com Unfortunately, the data does still have quite a few errors, and I have not ensured that names tie up properly, so the results are not always perfect. But still, it is quite a good demonstration. It would be even better with journal articles to search across.

open theses at EURODOC

- April 7, 2011 in Bibliographic, communityBenefits, inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, progress, progressPosts, WIN

#jiscopenbib #opentheses On Friday 1st April 2011, Mark MacGillivray, Peter Murray-Rust and Ben O’Steen remotely attended the EURODOC conference in Vilnius, Lithuania in order to take part in an Open Theses workshop locally hosted by Daniel Mietchen and Alfredo Ferreira (funded by the JISC Open Bib project to attend in person). During the workshop we began laying the foundations for open theses in Europe, discussing with current and recently finished postgraduate students and collecting data from those present and from anyone else interested. As described by Peter prior to the event:
As part of our JISCOpenBIB project we are running a workshop on Open Theses at EURODOC 2011. “We” is an extended community of volunteers centered round the main JISC project. In that project we have developed an approach to the representation of Open Bibliographic metadata, and now we are extending this to theses.

Why theses? Because, surprisingly, many theses are not easily discoverable outside their universities. So we are running the workshop to see how much metadata we can collect on European theses. Things like name, university, subject, datae, title – standard metadata.

We have the beginnings of a dataset at: https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dHFTNDhJU0xfdGhIT01WeTBMMDZWOGc&hl=en_GB&authkey=CJuy4owB The content of this datasheet will hopefully be used to populate an open theses collection in bibliographica, and in addition it is powering a mashup that will allow us to view at a glance the theses that have been published across the world, and where possible a link to the work itself: http://benosteen.com/eurodoc.html We also have a survey to fill in, to collect opinion around copyright issues for current / soon to be published theses, based at: http://openbiblio.net/opentheses-survey/ The data collected by this survey is available at: https://spreadsheets.google.com/ccc?key=0AnCtSdb7ZFJ3dDN1cHQ3TDJpYWRaWmkxWlFDS2lMWXc&hl=en_GB&authkey=CMKN-O8I#gid=0

JISC OpenBibliography: British Library data release

- November 17, 2010 in inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, progressPosts, WIN

The JISC OpenBibliography project is excited to announce that the British Library is providing a set of bibliographic data under CC0 Public Domain Dedication Licence. We have initially received a dataset consisting of approximately 3 million records, which is now available as a CKAN package. This dataset consists of the entire British National Bibliography, describing new books published in the UK since 1950; this represents about 20% of the total BL catalogue, and we are working to add further releases. In addition, we are developing sample access methods onto the data, which we will post about later this week. Agreements such as these are crucial to our community, as developments in areas such as Linked Data are only beneficial when there is content on which to operate. We look forward to announcing further releases and developments, and to being part of a community dedicated to the future of open scholarship. Usage guide from BL:
This usage guide is based on goodwill. It is not a legal contract. We ask that you respect it. Use of Data: This data is being made available under a Creative Commons CC0 1.0 Universal Public Domain Dedication licence. This means that the British Library Board makes no copyright, related or neighbouring rights claims to the data and does not apply any restrictions on subsequent use and reuse of the data. The British Library accepts no liability for damages from any use of the supplied data. For more detail please see the terms of the licence. Support: The British Library is committed to providing high quality services and accurate data. If you have any queries or identify any problems with the data please contact metadata@bl.uk. Share knowledge: We are also very interested to hear the ways in which you have used this data so we can understand more fully the benefits of sharing it and improve our services. Please contact metadata@bl.uk if you wish to share your experiences with us and those that are using this service. Give Credit Where Credit is Due: The British Library has a responsibility to maintain its bibliographic data on the nation’s behalf. Please credit all use of this data to the British Library and link back to www.bl.uk/bibliographic/datafree.html in order that this information can be shared and developed with today’s Internet users as well as future generations. Link to British Library announcement