You are browsing the archive for Adrian Pohl.

German National Library publishes 11.5 Million MARC records from national bibliography

- July 1, 2013 in Data, national library

In January 2012 the German National Library (DNB) already started publishing the national bibliographc as linked data under a CC0 license. Today, the DNB announced that it also publishes the national bibliography up to the year 2011 as MARC data. The full announcement reads as follows (quick translation by myself):
“All of German National Library’s title data which are offered under a Creative Commons Zero (CC0) license for free use are now available gratis as MARC 21 records. In total, these are more than 11.5 Million title records. Currently title data up to bibliography year 2011 is offered under a Creative Commons Zero license (CC0). For using the data a registration free of charge is necessary. Title data of the current and the previous year are subject to charge. The CC0 data package will be expanded by one bibliography year each first quarter of a year. It is planned to provide free access under CC0 conditions to all data in all formats in mid-2015. The German National Library thus takes into account the growing need for freely available metadata.”
As the MARC data contains much more information than the linked data (because not all MARC fields are currently mapped to RDF) this is good news for anybody who is interested in getting all the information available in the national bibliography. As DNB still makes money with selling the national bibliography to libraries and other interested parties it won’t release all bibliographic data until the present day into the public domain. It’s good to see that there already exist plans to switch to a fully free model in 2015. See also Lars Svensson: Licensing Library and Authority Data Under CC0: The DNB Experience (pdf).

Discovery silos vs. the open web

- June 23, 2013 in vendors

Bibliographic data that is not openly available on the web is harmful. In this post I’d like to point to a recent incident that demonstrates this: a correspondence between the board of the Orbis Cascade Alliance (“a consortium of 37 academic libraries in Oregon, Washington, and Idaho serving faculty and the equivalent of more than 258,000 full time students” (source)), Ex Libris, and EBSCO. The issue argued about is the provision of metadata describing content provided by EBSCO to Ex Libris’ discovery tool Primo. Thanks to the Orbis Cascade Alliance, the conversation is documented on the web. (I wish, more institutions would transparently document their negotiations with vendors as well as the resulting contracts…)

1. What is a discovery tool, anyway?

But first, for those who aren’t familiar with “next-generation discovery tools”, here is a short explanation of what these services are all about: Such a discovery tool provides a single interface that enables discovery of (almost) any resource a library provides access to. These are resources from its physical and electronic collections as well as electronic resources it has licensed and, furthermore, resources from openly available collections. Discovery tools are based upon a unified customized index that comprises the library’s catalog data and metadata (+ sometimes full text) from publishers and bibliographic databases. In order to pre-index content metadata and/or the fulltext, providers of discovery tools enter into agreements with publishers and aggregators. Libraries spend quite some money on purchasing a discovery service. These services are very popular. As of today Marshall Breeding’s lib-web-cats directory (library web sites and catalogs) records in sum more than 1250 libraries using one of the four leading discovery systems: Serials Solutions’ Summon, EBSCO Discovery Service (EDS), Ex Libris’ Primo and OCLC’s WorldCat Local.

2. An overview over the “EBSCO and Ex Libris slapfight”

So, what has been going on between Orbis Cascade Alliance, EBSCO and Ex Libris? In short (thanks to the summaries provided in this thread entitled “EBSCO and Ex Libris slapfight”):
EBSCO is offering both content and a discovery tool EDS. Ex Libris would like to include at least metadata for this content in its Primo discovery layer, so that users at libraries who subscribe to the EBSCO products can find it using the library’s Primo instance. EBSCO won’t provide any data to Ex Libris, only access to the EDS API so that their content is best/only accessed via EBSCO’s own discovery tool EDS.
Here’s a more detailied overview over what happened. (You may skip this if the summary above is enough for you and continue at paragraph 3.)

May 2, 2013, Letter from Orbis Cascade Alliance to Ex Libris and EBSCO

Board of Orbis Cascade Alliance writes to Ex Libris and EBSCO expressesing disappointment over the companies’ “failure to make EBSCO academic library content seamlessly and fully available via Ex Libris discovery services”. The Orbis Cascade Alliance estimates their payments to both companies for the coming five years to 30 Million dollars and says that – if this issue is not resolved – it “will be required to reconsider the shape and scope of future business with EBSCO and Ex Libris”.

May 6, 2013, Ex Libris response to Alliance Board

Ex Libris agrees that this problem is unacceptable and blames EBSCO for not providing metadata to Ex Libris anymore. After EBSCO would agree in 2009 on providing Ex Libris with “comprehensive metadata, including subject headings, for several of the key EBSCO databases”, they changed their policy in 2010 when EBSCO launched its own discovery services EDS. From then on, according to Ex Libris, they “made EDS Discovery a requirement for users who wanted to continue this type of access. They decided to no longer enable their content for indexing in Primo and instead required that Primo users access the content only via an API.” Ex Libris calls for an agreement with EBSCO “that would provide Primo customers with the content that EBSCO itself receives from external information providers – the content you and other libraries subscribe to, for which you should have access from your discovery platform ofchoice.” Ex Libris states that it has “in place many such agreements with other content providers”.

May 8, 2013, EBSCO response to Alliance Board

EBSCO mentions that – while there is no agreement with Ex Libris on providing data for Ex Libris’ discovery service – they have established such agreements with several other discovery service providers including OCLC and Serials Solutions. The existing agreements clarify the use of the EDS (EBSCO Discovery Service) API to make EBSCO content available via a discovery service. EBSCO’s view is that “an API solution is superior to a solution that relies strictly on metadata for several reasons, including the fact that we do not have the rights to provide (to Ex Libris or any third party) all of the content to that we feel is necessary for a quality user experience.” In fact, libraries don’t have the right to make the content they already licensed discoverable via Primo. The reasons named by EBSCO are that (a) Primo’s relevancy ranking will “not take advantage of the value added elements of their products” and (b) users wouldn’t have an incentive to use the original databases as they think all the content is available via Primo. From this, the “user experience” would suffer. Giving optimization of the user experience as the reason, EBSCO tries to have greatest possible exclusive control over content and metadata provided by them.

May 9, 2013, Alliance Board response to EBSCO and Ex Libris

The Orbis Cascade Alliance responds to the companies’ letters:
“While these letters illustrate the nature of this continuing impasse, they do nothing to address a remarkable and unacceptable disservice to your customers. (…) Ultimately we face a business decision. The Orbis Cascade Alliance is now actively investigating options and will make decisions that may move us away from your products in order to better serve our faculty, students, and researchers. Again, we urge EBSCO and Ex Libris to quickly resolve this issue.”

May 14, 2013, Ex Libris Open Letter to the Library Community

But obviously, EBSCO and Ex Libris are far away from “resolving this issue”. In the next step, Ex Libris responds to EBSCO’s response with a “point-by-point analysis” to refute the claims made by EBSCO. I had some problems with the terminology, so here are a few words about language usage: Ex Libris differs between “index-based search”, i.e. search over one central index just like a “true next generation discovery service” does it, and “API-based search”. This is a bit confusing as APIs are often based on an index so that – strictly speaking – “API-based search” and “index-based search” don’t necessarily exclude each other. But it makes a difference if a service like Primo, that is based on one index, has to use external APIs so that the service actually becomes a “metasearch tool” instead of a “true next generation discovery service”. (Talking about terminology – as I have the feeling not all people are using it in the same way as I do: I differ between content and (meta)data. In short and in this context, content is the thing scholars produce and read while metadata is the data that describes the content.) However, in short, Ex Libris says EBSCO would choose “not to share the content that they do have the rights to share” although this is the content that the respective library already has payed for. Ex Libris accordingly accuses EBSCO of wishing “to control the ranking of its content, which is possible through the API to EDS they require”. Interestingly, after Ex Libris state that EBSCO wants to control the ranking of its content it reads:
“EBSCO clearly believes that end-users … prefer to search database silos. This runs counter to what both end-users and libraries wish to achieve with a library-based discovery service.”
As discovery services are silos themselves – only bigger than the old silos – this is actually also an argument against Primo, EDS, Summon etc. In the end of this letter Ex Libris proclaims itself as the libraries’ comrade-in-arms who is fighting for their interests:
“We stand with you and continue to believe that together we can bring change such that EBSCO databases, whose rights EBSCO determines, are available to any EBSCO customer through the discovery service of choice.”
Indeed, there are overlapping interests of libraries and Ex Libris. But obviously Ex Libris is primarily following its own interest trying to get its own discovery service populated with the relevant data. Libraries should demand more than having the ability to chose one of several commercial products. Rather, anybody interested and equipped with the necessary amount of resources and technical capabilities should be able to get the metadata for academic publications that are accessible (with or without a toll) on the web in order to build their own discovery indexes. There are already a lot of open bibliographic data sets out there. But mostly, this data comes from libraries and related institutions, you won’t find much data from the publishers’ side. What we need is more and more publishers publishing their metadata, citation data etc. openly on the web, at best in the way Nature Publishing Group is doing it.

3. Rejecting silos, encouraging open bibliographic data

The conflict between EBSCO and Ex Libris is just another indicator of how important it is to move away from closed content and discovery silos to web-integrated, openly available bibliographic data. At least it would facilitate to get hold of the metadata for content provided via the web. Although bibliographic data for the majority of resources published in the past as print-only wouldn’t be covered, this constitutes a future-proof way of providing bibliographic data on the web. Rurik Greenall, (who wrote about NTNU’s LOD activities on openbiblio.net some time ago and from whom I took the “future-proof” terminology) recently summed it up nicely in his alternative “for the hard of understanding” version of his talk “Making future-proof library content for the Web” at this year’s ELAG conference. Libraries and publishers (and anybody else using the web as publication platform) should acknowledge how the web works and make their content and data available using persistent HTTP-URIs as identifiers and serving content and metadata using standards like HTML, PDF/A, TIFF+XMP, JPEG+XMP, JSON-LD, RDFa. In regard to discovery tools, Rurik provides the following conclusion of his talk packed as a rhetorical question:
“If you pay money to a content provider that is also a metadata provider and then buy a search index from them what motivation do they have to present their content in a findable way on the web?”
Accordingly, Rurik ends his talk with stating the need that librarians should ask themselves:
  • Do we deliver content to the web in the described way? (i. e. using persistent HTTP-URIs as identifiers and open standards like HTML, PDF/A, TIFF+XMP. JSON-LD etc.)
  • Do we subscribe to a service that does the exact opposite?
Unfortunately, many librarians – especially on the management level – are not aware of the importance of applying web standards and publishing open data. A lot of persuasion has to be done until this thinking becomes part of a broader mindset and non-open forms of publishing metadata and providing discovery tools won’t pay off anymore.

4. Guiding the way?

Carl Grant last week also published a blog post on the topic worth reading. I agree with him when he says “we need to define the guidelines under which we’ll buy products and services dealing with content, content enhancements, and discovery services.” The International Group of Ex Libris Users (Igelu) yesterday did a first step to get to such guidelines proposing a clause libraries should add to their contracts with content providers. Unfortunately, this proposal doesn’t go very far as it would only enable the indexing of “citation metadata (including without limitations subject headings and keywords), abstract and full-text, all as available” by “Discovery Service Providers”. Nowhere it is made clear who falls under this concept of “Discovery Service Provider”. For example, it isn’t clear at all if a library consortium wanting to index rich metadata that its members have subscribed to also is regarded a “Discovery Service Provider”. If you advocate open bibliographic data you should object the notion that bibliographic data be made available only to the exclusive club of “Discovery Service Providers”. Instead, anybody interested in providing a service, running some analytics or doing whatever else with that data should be able to collect it. It’s up to the advocates of open bibliographic data to participate in the development of guidelines for licensing content and discovery services.

Minutes: 28th Virtual Meeting of the OKFN Working Group for Open Bibliographic Data

- February 6, 2013 in minutes, OKFN Openbiblio

Date: February, 5th 2013, 16:00 GMT Channels: Meeting was held via Skype and Etherpad

Participants

  • Adrian Pohl
  • Karen Coyle
  • Tom Johnson
  • Tom Morris
  • On the Etherpad:
    • Peter Murray-Rust
    • Mark McGillivray

Agenda

  • As there were two new participants to the meeting (who already engaged in discussions on the mailing list though) attended the meeting everybody introduced themselves. The “new” participants were:
    • Tom Morris: “Tom Morris is the top external data contributor to Freebase and has contributed more than 1.6 million facts. He’s been a member of the Freebase community for several years. When not hacking on Freebase, Tom is an independent software engineering and product management consultant.” (taken from here, shortened and updated
    • Tom Johnson: “Thomas Johnson is Digital Applications Librarian at Oregon State University Libraries, where he works on digital curation, scholarly publication, and related metadata and software issues.

Bibframe and data licensing

  • Adrian started a discussion on the bibframe list, see here.
  • Karen: It isn’t clear to me how BIBFRAME will be documented, and whether that documentation will be sufficient to process data. Note that RDA (the cataloging rules) is not freely available, therefore if BIBFRAME does develop for RDA there may be conflicts relating to text such as term definitions.
    • This adresses licensing of bibframe spec, not the bibliographic data but may be a problem in the future if Bibframe re-uses content from the RDA spec.
  • Tom Morris: Licensing policy seems to be orthogonal to modelling process
  • Conclusion: We’ll wait as a working group and not push the LoC further towards open data.
  • Tom Morris: We should think about lobbying for making the process more open.
  • Tom Morris: German National Library and other early experimenters of bibframe should get up their code on github to bring the development forward

Bibliographic Extension for schema.org (schemabibex)

  • See minutes of last meeting for background information.
  • The work is moving forward to create more schema.org properties for bibliographic data — but so far not including journal articles
  • Library view point predominates at schemabibex group, scientists’ view point isn’t represented
  • Karen: Somebody from the scientific community should join schemabibex or start seperate effort. <– Maybe people from scholarlyhtml?

NISO Bibliographic Meeting

  • http://www.niso.org/topics/tl/BibliographicRoadmap/
  • NISO has a grant to hold a meeting of "interested parties" relating to bibliographic data.
  • Goes back to effort of Karen Coyle and another person to include other producers of bibliographic data than libraries (publishers, scientists etc.) in developments of future standards for bibliographic data (like Bibframe).
  • See also the thread on the openbiblio list. tfmorris: As much of the information as possible should be published online.
  • Meeting will be held in March or April in Washington D.C.
  • Interested parties can participate in the initial meeting but there's no/little funding. (See this email for the proposed dates of the meeting.
  • "We are planning to have a live-stream of the event, presuming there is sufficient bandwidth at the meeting site."

BiblioHackfests

  • Peter Murray-Rust wrote before the meeting: "I'd like to run a hackfest (in AU) later this month and make Bib an important aspect. Can we pull together a "hacking kit" for such an even (e.g. examples of BibJSON, some converters, a simple BibSoup, etc."
    • Mark McGillivray responded: "yes: I will write a blog post that explains bibsoup a bit more, and we could use a google spreadsheet for simple collection of records."

BibJSON

  • Tom Morris had two questions regarding BibJSON which and Mark provided some answers on the etherpad.
  • Q: What is being done to promoted adoption?
    • MM says: "_I and others continue to use bibjson and promote it on our projects. it is now being used by the open citations project and there will be updates to bibjson.org soon with further recommendations – mostly around how to specify provenance in a bibjson record. Also we have agreed with crossref for them to output bibjson – it needs some fixes to be correct, but is just about there.
  • Q: What tool support is available? (Mendeley, Zotero, converters, etc)
    • MArk says: "The translators are currently unavailable – they will soon be put up at a separate url for translating files to bibjson which can then be used in bibsoup. Mendeley, Zotero etc can all output bib collections in formats that we can already convert, so there is support in that sense. Separating out the translators will also make it easier for people to implement their own."
  • Tim morris: There's PR value in having BibJSON listed on the https://github.com/zotero/translators
  • Ways of promoting BibJSON:
    • Articles: Tom Johnson published an article on BibJSON application in code4lib journal: http://journal.code4lib.org/articles/7949
    • Talks: e.g. at code4lib (Tom Johnson will be there and might give a lightning talk mentioning BibJSON.),
    • Adoption: CrossRef would be a great addition. Need more services like Mendeley, Zotero, Open Library, BibSonomy etc. to support BibJSON (input/output)
  • Tom Johnson asks: What is the motivation to provide BibJSON output?

Open Library

  • Speaking about BibJSON adoption we camte to talking about what will happen to the Open Library. Karen gave a short summary of what are the future plans for Open Library:
    • Open Library currently has no assigned staff resources. Open Library is being integrated into the whole Internet Archive system and may cease using the current infogami platform. It isn't clear if the same UI will be available, nor if there will be any further development in terms of features such as APIs.
    • No batches of records (LC books records or Amazon records) have been loaded since mid-2012.
    • Tom Morris is primarily interested in the data and the process to reconcile it etc. but he also emphasizes the value of the brand and the community.
    • Karen: infogami is interesting as a flexible development platform that sits on a triple store: http://infogami.org/
    • Tom Johnson: What can we do regarding Open Library?
      • Karen: Set up a mirror?
      • Make records for free ebooks available as MARC so that libraries can integrate these into their catalogue. <– Tom Morris would help with that.

Public Domain Books/authors

Minutes: 27th Virtual Meeting of the OKFN Working Group for Open Bibliographic Data

- January 10, 2013 in minutes, OKFN Openbiblio

Date: January, 8th 2013, 16:00 GMT Channels: Meeting was held via Skype and Etherpad

Participants

  • Adrian Pohl
  • Peter Murray-Rust
  • Richard Wallis

Agenda

Schemabib Extension group update

  • Links:
  • W3C community and business group, started by Richard Wallis (OCLC) in September 2012
  • Conference meeting once a month
  • Idea: Get consensus across the bibliographic community about how to extend schema.org.
  • Lightweight approach, should not compete with MARC
  • Most people interested in bibliodata come from the library community. Richard tried to extend the group to other people (publishers, scholars etc.).
  • Background: OCLC publishing Linked Data in worldcat.org using schema.org vocabulary. schema.org missed properties
  • In the end: Publish extension proposal to the public-vocabs list
  • Peter comments on schema.org: schema.org is going to work because its built by people who know how the web works
  • Currently discussion about the concept of work and instances; FRBR comes up but such a model wouldn’t make it into schema.org
  • Richard: It makes sense to publish schema.org alongside BibFrame or RDA.
  • Peter: Talking to Mark McGillivray might make sense to find out how schema.org bibdata can relate to BibJSON and the accompanying tools.

Bibframe draft data model

GOKb (Global Open Knowledgebase)

Adrian heard about this project but all he could find on the web about it was litte information: “Kuali OLE, one of the largest academic library software collaborations in the United States, and JISC, the UK’s expert on digital technologies for education and research, announce a collaboration that will make data about e-resources—such as publication and licensing information—more easily available. Together, Kuali OLE and JISC will develop an international open data repository that will give academic libraries a broader view of subscribed resources.
The effort, known as the Global Open Knowledgebase (GOKb) project, is funded in part by a $499,000 grant from The Andrew W. Mellon Foundation. North Carolina State University will serve as lead institution for the project.
GOKb will be an open, community-based, international data repository that will provide libraries with publication information about electronic resources. This information will support libraries in providing efficient and effective services to their users and ensure that critical electronic collections are available to their students and researchers.” from http://gokb.org/post/25021222983/gobkpressreleaseGOKb is … focused on global-level metadata about e-resources with the goal of supporting management of those e-resources across the resource lifecycle. GOKb does not aspire to replace current vendor-provided KB products. But it does aspire to make good data available to everybody, including existing KBs, and to provide an open and low-barrier way for libraries to access this data. Our goal is that GOKb data is permeates the KB ecosystem so that all library systems, whether ILS, ERM, KB or discovery, will have better quality data about electronic collections than they do today.” From http://kualiole.tumblr.com/post/32942331929/bib-data-is-now-more-open-what-about-knowledge-base
  • The oparticipants didn’t know much more about this initiative. Adrian will try to find out more for upcoming meetings.

Other

  • Peter briefly informed about some interesting developments: *Open citations: http://opencitations.wordpress.com/ (David Shotton, Oxford, Uk)
    • Hargreaves report: UK government says it’s legal toc mine content. See Peter’s post at [http://blogs.ch.cam.ac.uk/pmr/2012/12/21/opencontentmining-massive-step-forward-come-and-join-us-in-the-uk/](http://blogs.ch.cam.ac.uk/pmr/2012/12/21/opencontentmining-massive-step-forward-come-and-join-us-in-the-uk/]
    • Pubcrawler
    • Crossref biblio/citation data

Minutes: 26th Virtual Meeting of the OKFN Working Group for Open Bibliographic Data

- November 7, 2012 in minutes, OKFN Openbiblio

Date: November, 6th 2012, 16:00 GMT Channels: Meeting was held via Skype and Etherpad

Participants

  • Adrian Pohl
  • Karen Coyle
  • Joris Pekel
  • Jim Pitman

Agenda

ORCID launched

“ORCID makes its code available under an open source license, and will post an annual public data file under a CCO waiver for free download.” (Source: http://about.orcid.org/about/what-is-orcid.) Open Data
  • ORCID provides annual CC0 dump.
Open API
  • To try the open API point your queries to pub.orcid.org ! (Documentation says something else)
  • Query biographies example:
    • curl -H ‘Accept: application/orcid+xml’ http://pub.orcid.org/search/orcid-bio?q=pohl
    • Retrieve bio example: curl -H “Accept: application/orcid+json” “http://pub.orcid.org/0000-0001-9083-7442/orcid-bio”
Open source Linked Open Data (Much information was taken from this twitter conversation.)
  • Karen: How can this be intregrated with BibServer
  • Jim: Could OKF pick up and post periodic dumps of ORCID data? And support a BibServer over those dumps?

HathiTrust Lawsuit

See Karen’s blog post on the topic: http://kcoyle.blogspot.de/2012/10/copyright-victories-part-ii.html.
  • Judge supports digitization for indexing as a fair use.
  • No decision on orphan works
  • Support for “just in case” digitization to serve sight impaired users
  • Support for digitization for preservation

OKFN labs for cultural activities

  • Background: Restructuring of OKF
  • Projects and tools are now pulled into OKFN labs, which will mainly focus on government and financial data: http://okfnlabs.org/
  • Rather than “orphan” the other projects, there is now another lab in development for those, including Bibserver.
  • Example projects/code and blog posts that woul find their place at this “open culture lab”:
  • Joris, Sam and Etienne Posthumus working on this. Please propose projects to Joris and Sam and they can help.
  • Suggest: organize “code days” for bibliographic data

W3C working group on biblio extension to schema.org

Journal Article Tag Suite (JATS) Standard

MISC

  • May merge some developer lists into one, which are now scattered. openbiblio-dev could be included in this.
  • We talked for a short time about ResourceSync effort to provide standard for syncing web resources: http://www.niso.org/workrooms/resourcesync/

To Dos

  • Adrian will try to find time for a seperate post on ORCID

Metadata for over 20 Million Cultural Objects released into the Public Domain by Europeana

- September 12, 2012 in Data, Europeana, lod-lam

Europeana today announced that its dataset comprising descriptions of more than 20 Million cultural objects is from now on openly licensed with Creative Commons’ public domain waiver CC0. From the announcement: Europeana logoOpportunities for apps developers, designers and other digital innovators will be boosted today as the digital portal Europeana opens up its dataset of over 20 million cultural objects for free re-use. The massive dataset is the descriptive information about Europe’s digitised treasures. For the first time, the metadata is released under the Creative Commons CC0 Public Domain Dedication, meaning that anyone can use the data for any purpose – creative, educational, commercial – with no restrictions. This release, which is by far the largest one-time dedication of cultural data to the public domain using CC0 offers a new boost to the digital economy, providing electronic entrepreneurs with opportunities to create innovative apps and games for tablets and smartphones and to create new web services and portals. Europeana’s move to CC0 is a step change in open data access. Releasing data from across the memory organisations of every EU country sets an important new international precedent, a decisive move away from the world of closed and controlled data.“ Thanks to all the people who made this possible! See also Jonathan Gray’s post at the Guardian’s Datablog. Update 30 September 2012: Actually, it is not true to call this release “by far the largest one-time dedication of cultural data to the public domain using CC0″. In December 2011 two German library networks released their catalog b3kat under CC0 which by then held 22 million descriptions of bibliographic resources. See this post for more information.

Nature’s data platform strongly expanded

- July 20, 2012 in Data, News, Semantic Web

Nature has largely expanded its Linked Open Data platform that was launched in April 2012. From today’s press release: Logo of the journal Nature used in its first issue on Nov. 4, 1869 “As part of its wider commitment to open science, Nature Publishing Group’s (NPG) Linked Data Platform now hosts more than 270 million Resource Description Framework (RDF) statements. It has been expanded more than ten times, in a growing number of datasets. These datasets have been created under the Creative Commons Zero (CC0) waiver, which permits maximal use/reuse of this data. The data is now being updated in real-time and new triples are being dynamically added to the datasets as articles are published on nature.com. Available at http://data.nature.com, the platform now contains bibliographic metadata for all NPG titles, including Scientific American back to 1845, and NPG’s academic journals published on behalf of our society partners. NPG’s Linked Data Platform now includes citation metadata for all published article references. The NPG subject ontology is also significantly expanded. The new release expands the platform to include additional RDF statements of bibliographic, citation, data citation and ontology metadata, which are organised into 12 datasets – an increase from the 8 datasets previously available. Full snapshots of this data release are now available for download, either by individual dataset or as a complete package, for registered users at http://developers.nature.com.“ This is exciting, especially the commitment to real-time updates is a great move and shows how serious Linked Open Data becomes in general and in particular in the realm of bibliographic data. Also, Nature now uses the Data Hub and has registered the data seperated into several datasets.

Linked Data in worldcat.org

- June 23, 2012 in Data, lod-lam

This post was first published on Übertext: Blog. Two days ago OCLC announced that linked data has been added to worldcat.org. I took a quick look at it and just want to share some notes on this.

OCLC goes open, finally

I am very happy that OCLC – with using the ODC-BY license – finally managed to choose open licensing for WorldCat. Quite a change of attitude when you recall the attempt in 2008 to sneak in a restrictive viral copyright license as part of a WorldCat record policy (for more information see the code4lib wikipage on the policy change or my German article about it). Certainly, it were not at last the blogging librarians and library tech people, the open access/open data proponents etc. who didn’t stop to push OCLC towards openness, who made this possible. Thank you all!

Of course, this is only the beginning. One thing is, that dumps of this WorldCat data aren’t available yet (see follow-up addendum here), thus, making it necessary to crawl the whole WorldCat to get hold of the data. Another thing is, that there probably is a whole lot of useful information in WorldCat that isn’t part of the linked data in worldcat.org yet .

schema.org in RDFa and microdata

What information is actually encoded as linked data in worldcat.org? And how did OCLC add RDF to worldcat.org? It used the schema.org vocabulary to add semantic markup to the HTML. This markup is both added as microdata – the native choice fo schema.org vocab – as well as in RDFa. schema.org lets people choose how to use the vocabulary, on the schema.org blog it recently said: “Our approach is “Microdata and more”. As implementations and services begin to consume RDFa 1.1, publishers with an interest in mixing schema.org with additional vocabularies, or who are using tools like Drupal 7, may find RDFa well worth exploring.

Let’s take a look at a description of a bibliographic resource in worldcat.org, e.g. http://www.worldcat.org/title/linked-data-evolving-the-web-into-a-global-data-space/oclc/704257552The part of the HTML source containing the semantic markup is marked as “Microdata Section” (although it does also contain RDFa). As the HTML source isn’t really readable for humans, we need to get hold of the RDF in a readable form first to have a look at it. I prefer the turtle syntax for looking at RDF. One can get the RDF contained in the HTML out using the RDFa distiller provided by the W3C. More precisely you have to use the distiller that supports RDFa 1.1 as schema.org supports RDFa 1.1 and, thus, worldcat.org is enriched according to the RDFa 1.1 standard.

However, using the distiller on the example resource I can get back a turtle document that contains the following triples:

1:  @prefix library: <http://purl.org/library/> .
2: @prefix madsrdf: <http://www.loc.gov/mads/rdf/v1#> .
3: @prefix owl: <http://www.w3.org/2002/07/owl#> .
4: @prefix schema: <http://schema.org/> .
5: @prefix skos: <http://www.w3.org/2004/02/skos/core#> .
6: <http://www.worldcat.org/oclc/707877350> a schema:Book;
7: library:holdingsCount "1"@en;
8: library:oclcnum "707877350"@en;
9: library:placeOfPublication [ a schema:Place;
10: schema:name "San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) :"@en ];
11: schema:about [ a skos:Concept;
12: schema:name "Web site development."@en;
13: madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh98004795> ],
14: [ a skos:Concept;
15: schema:name "Semantic Web."@en;
16: madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh2002000569> ],
17: <http://dewey.info/class/025/e22/>,
18: <http://id.worldcat.org/fast/1112076>,
19: <http://id.worldcat.org/fast/1173243>;
20: schema:author <http://viaf.org/viaf/38278185>;
21: schema:bookFormat schema:EBook;
22: schema:contributor <http://viaf.org/viaf/171087834>;
23: schema:copyrightYear "2011"@en;
24: schema:description "1. Introduction -- The data deluge -- The rationale for linked data -- Structure enables sophisticated processing -- Hyperlinks connect distributed data -- From data islands to a global data space -- Introducing Big Lynx productions --"@en,
25: "The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study."@en;
26: schema:inLanguage "en"@en;
27: schema:isbn "1608454312"@en,
28: "9781608454310"@en;
29: schema:name "Linked data evolving the web into a global data space"@en;
30: schema:publisher [ a schema:Organization;
31: schema:name "Morgan & Claypool"@en ];
32: owl:sameAs <http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001> .

This looks quite nice to me. You see, how schema.org let’s you easily convey the most relevant information and the property names are well-chosen to make it easy for humans to read the RDF (in contrast e.g. to the ISBD vocabulary which uses numbers in the property URIs following the library tradition :-/).

The example also shows the current shortcomings of schema.org and where the library community might put some effort in to extending it, as OCLC has already been doing for this release with the experimental “library” extension vocabulary for use with Schema.org. E.g., there are no seperate schema.org properties for a table of content and an abstract so that they are both put into one string using ther schema:description property.

Links to other linked data sources

There are links to several other data sources: LoC authorities (lines 13, 16, 41, 44) , dewey.info (17), the linked data FAST headings (18,19), viaf.org (20,22) and an owl:sameAs link to the HTTP-DOI identifier (32). As most of these services are already run by OCLC and as the connections probably all were already existent in the data, creating these links wasn’t hard work, which of course doesn’t make them less useful.

Copyright information

What I found very interesting is the schema:copyrightYear property used in some descriptions in worldcat.org. I don’t know how much resources are covered with the indication of a copyright year and how accurate the data is, but this seems a useful source to me for projects like publicdomainworks.net.

Missing URIs

As with other preceding publications of linked bibliographic data there are some URIs missing for things we might want to link to instead of only serving the name string of the respecting entity: I am talking about places and publishers. Until now, AFAIK URIs for publishers don’t exist, hopefully someone (OCLC perhaps?) is already working on a LOD registry for publishers. For places, we have geonames but it is not that trivial to generate the right links. It’s not a great surprise that a lot of work has to be done to build the global data space.

Bringing the Open German National Bibliography to a BibServer

- June 18, 2012 in BibServer, Data, event, Events, jiscopenbib2, national library, wp5

This blog post is written by Etienne Posthumus and Adrian Pohl. We are happy that the German National Library recently released the German National Bibliography as Linked Open Data, see (announcement). At the #bibliohack this week we worked on getting the data into a BibServer instance. Here, we want to share our experiences in trying to re-use this dataset.

Parsing large turtle files: problem and solution

The raw data file is 1.1GB in a compressed format – unzipped it is a 6.8 GB turtle file. Working with this file is unwieldy, it can not be read into memory or converted with tools like rapper (which only works for turtle files up to 2 GB, see this mail thread). Thus, it would be nice if the German National Library could either provide one big N-Triples file that is better for streaming processing or provide a number of smaller turtle files. Our solution to get the file into a workable form is to make a small Python script that is Turtle syntax aware, to split the file into smaller pieces. You can’t use the standard UNIX split command, as each snippet of the split file also needs the prefix information at the top and we do not want to split an entry in the middle, losing triples. See a sample converted N-Triples file from a turtle snippet.

Converting the N-Triples to BibJSON

After this, we started working on parsing an example N-Triples file to convert the data to BibJSON. We haven’t gotten that far, though. See https://gist.github.com/2928984#file_ntriple2bibjson.py for the resulting code (work in progress).

Problems

We noted problems with some properties that we like to document here as feedback for the German National Library.

Heterogeneous use of dcterms:extent

The dcterms:extent property is used in many different ways, thus we are considering to omit it in the conversion to BibJSON. Some example values of this property: “Mikrofiches”, “21 cm”, “CD-ROMs”, “Videokassetten”, “XVII, 330 S.”. Probably it would be the more appropriate choice to use dcterms:format for most of these and to limit the use of dcterms:extent to pagination information and duration.

URIs that don’t resolve

We stumbled over some URIs that don’t resolve, whether you order RDF or HTML in the accept header. Examples: http://d-nb.info/019673442, http://d-nb.info/019675585, http://d-nb.info/011077166 Also, DDC URIs that are connected to a resource with dcters:subject don’t resolve, e.g. http://d-nb.info/ddc-sg/070.

Footnote

At a previous BibServer hackday, we loaded the Britsh National Bibliography data into BibServer. This was a similar problem, but as the data was in RDF/XML we could directly use the built-in Python XML streaming parser to convert the RDF data into BibJSON. See: https://gist.github.com/1731588 for the source.

Harvard Library releases 12M bibliographic records under CC0

- April 25, 2012 in Data, News

Harvard Library yesterday announced the release of 12 Million bibliographic record into the public domain using CC0. From the announcement: “The Harvard Library announced it is making more than 12 million catalog records from Harvard’s 73 libraries publicly available. The records contain bibliographic information about books, videos, audio recordings, images, manuscripts, maps, and more. The Harvard Library is making these records available in accordance with its Open Metadata Policy and under a Creative Commons 0 (CC0) public domain license. In addition, the Harvard Library announced its open distribution of metadata from its Digital Access to Scholarship at Harvard (DASH) scholarly article repository under a similar CC0 license. ‘The Harvard Library is committed to collaboration and open access. We hope this contribution is one of many steps toward sharing the vital cultural knowledge held by libraries with all,’ said Mary Lee Kennedy, Senior Associate Provost for the Harvard Library. The catalog records are available for bulk download from Harvard, and are available for programmatic access by software applications via API’s at the Digital Public Library of America (DPLA). The records are in the standard MARC21 format.“ That’s great news. There already is an entry for this dataset at the Data Hub. See also David Weinberger’s post on the data release.