You are browsing the archive for projectMethodology.

Open source development – how we are doing

- May 29, 2012 in BibServer, JISC OpenBib, jiscopenbib2, licensing, progress, progressPosts, projectMethodology, projectPlan, riskAnalysis, software, WIN, wp10, wp2, wp3, wp6, wp9

Whilst at Open Source Junction earlier this year, I talked to Sander van der Waal and Rowan Wilson about the problems of doing open source development. Sander and Rowan work at OSS watch, and their aim is to make sure that open source software development delivers its potential to UK HEI and research; so, I thought it would be good to get their feedback on how our project is doing, and if there is anything we are getting wrong or could improve on. It struck me that as other JISC projects such as ours are required to make their output similarly publicly available, this discussion may be of benefit to others; after all, not everyone knows what open source software is, let alone the complexities that can arise from trying to create such software. Whilst we cannot help avoid all such complexities, we can at least detail what we have found helpful to date, and how OSS Watch view our efforts. I provided Sander and Rowan a review of our project, and Rowan provided some feedback confirming that overall we are doing a good job, although we lack a listing of the other open source software our project relies on, and their licenses. Whilst such data can be discerned from the dependencies of the project, this is not clear enough; I will add a written list of dependencies to the README. The response we received is provided below, followed by the overview I initially provided, which gives a brief overview of how we managed our open source development efforts: ==== Rowan Wilson, OSS Watch, responds: Your work on this project is extremely impressive. You have the systems in place that we recommend for open development and creation of community around software, and you are using them. As an outsider I am able to quickly see that your project is active and the mailing list and roadmap present information about ways in which I could participate. One thing I could not find, although this may be my fault, is a list of third party software within the distribution. This may well be because there is none, but it’s something I would generally be keen to see for the purposes of auditing licence compatibility. Overall though I commend you on how tangible and visible the development work on this project is, and on the focus on user-base expansion that is evident on the mailing list. ==== Mark MacGillivray wrote: Background – May 2011, OKF / AIM bibserver project Open Knowledge Foundation contracted with American Institute of Mathematics under the direction of Jim Pitman in the dept. of Maths and Stats at UC Berkeley. The purpose of the project was to create an open source software repository named BibServer, and to develop a software tool that could be deployed by anyone requiring an easy way to put and share bibliographic records online. A repository was created at, and it performs the usual logging of commits and other activities expected of a modern DVCS system. This work was completed in September 2011, and the repository has been available since the start of that project with a GNU Affero GPL v3 licence attached. October 2011 – JISC Open Biblio 2 project The JISC Open BIblio 2 project chose to build on the open source software tool named BibServer. As there was no support from AIM for maintaining the BibServer repository, the project took on maintenance of the repository and all further development work, with no change to previous licence conditions. We made this choice as we perceive open source licensing as a benefit rather than a threat; it fit very well with the requirements of JISC and with the desires of the developers involved in the project. At worst, an owner may change the licence attached to some software, but even in such a situation we could continue our work by forking from the last available open source version (presuming that licence conditions cannot be altered retrospectively). The code continues to display the licence under which it is available, and remains publicly downloadable at Should this hosting resource become publicly unavailable, an alternative public host would be sought. Development work and discussion has been managed publicly, via a combination of the project website at, the issue tracker at, a project wiki at, and via a mailing list at February 2012 – JISC Open Biblio 2 offers beta service In February the JISC Open Biblio 2 project announced a beta service available online for free public use at The website runs an instance of BibServer, and highlights that the code is open source and available (linking to the repository) to anyone who wishes to use it. Current status We believe that we have made sensible decisions in choosing open source software for our project, and have made all efforts to promote the fact that the code is freely and publicly available. We have found the open source development paradigm to be highly beneficial – it has enabled us to publicly share all the work we have done on the project, increasing engagement with potential users and also with collaborators; we have also been able to take advantage of other open source software during the project, incorporating it into our work to enable faster development and improved outcomes. We continue to develop code for the benefit of people wishing to publicly put and share their bibliographies online, and all our outputs will continue to be publicly available beyond the end of the current project.

"Bundling" instances of author names together without using owl:sameas

- November 17, 2010 in inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, ORDF, progress, projectMethodology, rdf, Semantic Web

Bundling? It’s a verb I’ve taken from ”Glaser, H., Millard, I., Jaffri, A., Lewy, T. and Dowling, B. (2008) On Coreference and The Semantic Web” where the core idea is that you have a number of URIs that mean or reference the same real thing, and the technique they describe of bundling is to aggregate all those references together. The manner in which they describe is built on a sound basis in logic, and is related to (if not the same as) a congruent closure. The notion of bundling I am using is not as rooted in terms of mathematical logic, because I need to convey an assertion that one URI is meant to represent the same thing that another URI represents in a given context and for a given reason. This is a different assertion, if only subtly different, than ‘owl:sameas’ asserts, but the difference is key for me. It is best to think through an example of where I am using this – curating bibliographic records and linking authors together. It’s an obvious desire – given a book or article, to find all the other works by an author of that said work. Technologically, with RDF this is a very simple proposition BUT the data needs to be there. This is the point where we come unstuck. We don’t really have that quality of data that firmly establishes that one author is the same as a number of others. String matching is not enough! So, how do we clean up this data (converted to RDF) so that we can try to stitch together the authors and other entities in them? See this previous post on augmenting British Library metadata so that the authors, publishers and so on are externally reference-able once they are given unique URIs. This really is the key step. Any other work that can be done to make any of the data about the authors and so on more semantically reference-able will be a boon to the process of connecting the dots, as I have done for authors with birth and/or death dates. The fundamental aspect to realise is that we are dealing with datasets which have missing data, misrepresented data (typos), misinterpreted fields (ISBNs of £2.50 for example) and other non-uniform and irregular problems. Connecting authors together in datasets with these characteristics will rely on us and code that we write making educated guesses, and probabilistic assertions, based on how confident we are that things match and so on. We cannot say for sure that something is a cast-iron match, only that we are above a certain limit of confidence that this is so. We also have to have a good reason as well. Something else to take on board is that what I would consider to be a good match might not be good for someone else so there needs to be a manner to state a connection and to say why, who and how this match was made as well as a need to keep this data made up of assertions away from our source data. I’ve adopted the following model for encoding this assertion in RDF, in a form that sits outside of the source data, as a form of overlay data and you can find the bundle ontology I’ve used at (pay no attention to where it is currently living): Click to view in full, unsquished form: Bundle of URIs, showing use of OPMV The URIs shown to be ‘opmv:used’ in this diagram are not meant to be exhaustive. It is likely that a bundle may depend on a look-up or resolution service, external datasheets, authority files, csv lists, dictionary lists and so on. Note that the ‘Reason’ class has few, if any, mandatory properties aside from its connection to a given Bundle and opmv:Process. Assessing if you trust a Bundle at this moment is very much based on the source and the agent that made the assertion. As things get more mature, more information will regularly find its place attached to a ‘Reason’ instance. There are currently two subtypes of Reason: AlgorithmicReason and AgentReason. Straightforwardly, this is the difference between a machine-made match and a human-made match and use of these should aid the assessment of a given match. Creating a bundle using python: I have added a few classes to Will Waites’ excellent ‘ordf’ library, and you can find my version here. To create a virtualenv to work within, do as follows. You will need mercurial and virtualenv already installed: At a command line – eg ‘[@localhost] $’, enter the following:
hg clone
virtualenv myenv
. ./myenv/bin/activate
(myenv) $ pip install ordf
So, creating a bundle of some URIs – “info:foo” and “info:bar”, due to a human choice of “They look the same to me :) ”: In python: code here from ordf.vocab.bundle import Bundle, Reason, AlgorithmicReason, AgentReason from ordf.vocab.opmv import Agent from ordf.namespace import RDF, BUNDLE, OPMV, DC # you are likely to use these yourself from ordf.term import Literal, URIRef # when adding arbitrary triples b = Bundle() """or if you don't want a bnode for the Bundle URI: b = Bundle(identifier="")""" """ NB this also instantiates empty bundle.Reason and opmv.Process instances too in b.reason and b.process which are used to create the final combined graph at the end""" b.encapsulate( URIRef("info:foo"), URIRef("info:bar") ) """ we don't want the default plain Reason, we want a human reason:""" r = AgentReason() """ again, pass a identifier="" kw to set the URI if you wish""" r.comment("They look the same to me :) ") """Let them know who made the assertion:""" a = Agent() a.nick("benosteen") a.homepage("") """ Add this agent as the controller of the process:""" b.process.agent(a) g = b.bundle_graph() # this creates an in-memory graph of all the triples required to assert this bundle """ easiest way to get it out is to "serialize" it:""" print g.serialize() ============== Output:
<?xml version="1.0" encoding="UTF-8"?>
  <rdf:Description rdf:nodeID="PZCNCkfJ2">
    <rdfs:label> on monster (18787)</rdfs:label>
    <ordf:pid rdf:datatype="">18787</ordf:pid>
    <opmv:wasControlledBy rdf:nodeID="PZCNCkfJ9"/>
    <ordf:version rdf:nodeID="PZCNCkfJ4"/>
    <rdf:type rdf:resource=""/>
  <rdf:Description rdf:nodeID="PZCNCkfJ0">
    <bundle:encapsulates rdf:resource="info:bar"/>
    <bundle:encapsulates rdf:resource="info:foo"/>
    <bundle:justifiedby rdf:nodeID="PZCNCkfJ5"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
    <rdf:type rdf:resource=""/>
  <rdf:Description rdf:nodeID="PZCNCkfJ5">
    <rdf:type rdf:resource=""/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
  <rdf:Description rdf:nodeID="PZCNCkfJ9">
    <rdf:type rdf:resource=""/>
    <foaf:homepage rdf:resource=""/>
  <rdf:Description rdf:nodeID="PZCNCkfJ4">

Given a triplestore with these bundles, you can query for ‘same as’ URIs via which Bundles a given URI appears in.