You are browsing the archive for benosteen.

“Full-text” search for openbiblio, using Apache Solr

- May 25, 2011 in JISC OpenBib

Overview: This provides a simple search interface for openbiblio, using a network-addressable Apache Solr instance to provide FTS over the content. The indexer currently relies on the Entry Model (in /model/entry.py) to provide an acceptable dictionary of terms to be fed to a solr instance. Configuration: In the paster main .ini, you need to set the param ‘solr.server’ to point to the solr instance. For example, ‘http://localhost:8983/solr’ or ‘http://solr.okfn.org/solr/bibliographica.org’. If the instance requires authentication, set the ‘solr.http_user’ and ‘solr.http_pass’ parameters too. (Solr is often put behind a password-protected proxy, due to its lack of native authentication for updating the index.) Basic usage: The search controller: solr_search.py (linked in config/routing.py to /search) Provides HTML and JSON responses (content-negotiation or .html/.json as desired) and interprets a limited but easily expandable subset of Solr params (see ALLOWED_TERMS in the /controller/solr_search.py.) JSON response is the raw solr response as this is quite usable in javascript. HTML response is styled in the same manner as the previous (xapian-based?) search service, with the key template function formatting each row in templates/paginated_common.html – genshi function “solr_search_row”. Unless specified, the search controller will get all the fields it can for the search terms, meaning that the list of resuts in c.solr.results contain dicts with much more information than is currently exposed. The potentially available fields are as follows:
    "uri"          # URI for the item - eg http://bibligraphica.org/entry/BB1000
    "title"        # Title of the item
    "type"         # URI type(s) of the item (eg http://.... bibo#Document)
    "description"
    "issued"       # Corresponds to the date issued, if given.
    "extent"
    "language"     # ISO formatted, 3 lettered - eg 'eng'
    "hasEditionStatement"
    "replaces"        # Free-text entry for the work that this item supercedes
    "isReplacedBy"    # Vice-versa above
    "contributor"           # Author, collaborator, co-author, etc
                            # Formatted as "John Smith b1920 "
                            # Use lib/helpers.py:extracturi method to add formatting.
                            # Give it a list of these sorts of strings, and it will return
                            # a list of tuples back, in the form ("John Smith b1920", "http...")
                            # or ("John Smith", "") if no -enclosed URI is found.
    "contributor_filtered"  # URIs removed
    "contributor_uris"      # Just the entity URIs alone
    "editor"                # editor and publisher are formatted as contributor
    "publisher"
    "publisher_uris"        # list of publisher entity URIs
    "placeofpublication"    # Place of publication - as defined in ISBD. Possible and likely to
                            # have multiple locations here
    "keyword"               # Keyword (eg not ascribed to a taxonomy)
    "ddc"                   # Dewey number (formatted as contributor, if accompanied by a URI scheme)
    "ddc_inscheme"          # Just the dewey scheme URIs
    "lcsh"                  # eg "Music "
    "lcsh_inscheme"         # lcsh URIs
    "subjects"              # Catch-all,with all the above subjects queriable in one field.
    "bnb_id"                # Identifiers, if found in the item
    "bl_id"
    "isbn"
    "issn"
    "eissn"
    "nlmid"                 # NLM-specific id, used in PubMed
    "seeAlso"               # URIs pertinent to this item
    "series_title"          # If part of a series: (again, formatted like contributor)
    "series_uris"
    "container_title"       # If it has some other container, like a Journal, or similar
    "container_type"
    "text"                  # Catch-all and default search field.
                            # Covers: title, contributor, description, publisher, and subjects
    "f_title"               # Fields indexed to be suitable for facetting
    "f_contributor"         # Contents as above
    "f_subjects
    "f_publisher"
    "f_placeofpublication"  # See http://wiki.apache.org/solr/SimpleFacetParameters for info
The query text is passed to the solr instance verbatim, so it is possible to do complex queries within the textbox, according to normal solr/lucene syntax. See http://wiki.apache.org/solr/SolrQuerySyntax for some generic documentation. The basics of the more advanced search are as follows however:
  • field:query — search only within a given field,
eg ‘contributor:”Dickens, Charles”‘ Note that query text within quotes is searched for as declared. The above search will not hit an author value of “Charles Dickens” for example (and why the above is not a good way to search generically.)
  • Booleans, AND and OR — if left out, multiple queries will be OR’d
eg ‘contributor:Dickens contributor:Charles’ == ‘contributor:Dickens OR contributor:Charles’ The above will match contributors who are called ‘Charles’ OR ‘Dickens’ (non-exclusively), which is unlikely to be what is desired. ‘Charles Smith’ and ‘Eliza Dickens’ would be valid hits in this search. ‘contributor:Dickens AND contributor:Charles’ would be closer to what is intended.
  • URI matching — many fields include the URI and these can be used to be specific about the match
eg ‘contributor:”http://bibliographica.org/entity/E200000″‘ Given an entity URI therefore, you can see which items are published/contributed/etc just by performing a search for the URI in that field. Basic Solr Updating: The ‘solrpy’ library is used to talk to a Solr instance and so seek that project out for library-specific documentation. (>=0.9.4 as this includes basic auth) Fundamentally, to update the index, you need an Entry (model/entry.py) instance mapped to the item you wish to (re)index and a valid SolrConnection instance.
from solr import SolrConnection, SolrException
s = SolrConnection("http://host/solr", http_user="usernamehere", http_pass="passwordhere")
e = Entry.get_by_uri("Entry Graph URI")
Then, it’s straightforward: (catching two typical errors that might be thrown due to a bad or incorrectly configured Solr connection.)
from socket import error as SocketError
try:
    s.add(e.to_solr_dict())
    # Uncomment the next line to commit updates (inadvisable to do after every small change of a bulk update):
    # s.commit()
except SocketError:
    print "Solr isn't responding or isn't there"
    # Do something here about it
except SolrException:
    print "Something wrong with the update that was sent. Make sure the solr instance has the correct schema in place and is working and that the Entry has something in it."
    # Do something here, like log the error, etc
Bulk Solr updating from nquads: There is a paster command for taking the nquads Bibliographica.org dataset, parsing this into mapped Entry’s and then performing the above.
    Usage: paster indexnquads [options] config.ini NQuadFile
Create Solr index from an NQuad input

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -b BATCHSIZE, --batchsize=BATCHSIZE
                        Number of solr 'docs' to combine into a single update
                        request document
  -j TOJSON, --json=TOJSON
                        Do not update solr - entry's solr dicts will be
                        json.dumped to files for later solr updating
The –json option is particularly useful for production systems, as the time consuming part of this is the parsing and mapping to Entry’s and you can offload that drain to any computer and upload the solrupdate*.json files it creates directly to the production system for rapid indexing. NOTE! This will start with solrupdate0.json and iterate up. IT WONT CHECK for existence of previous solr updates and they will be overwritten! [I used a batchsize of 10000 when using the json export method] Bulk Solr updating from aforementioned solrupdate.json:
    paster indexjson [options] config.ini solrupdate
    Create Solr index from a JSON serialised list of dicts

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -C COMMIT, --commit=COMMIT
                        COMMIT the solr index after sending all the updates
  -o OPTIMISE, --optimise=OPTIMISE
                        Optimise the solr index after sending all the updates
                        and committing (forces a commit)
eg “paster indexjson development.ini –commit=True /path/to/solrupdate*”

Academic Bibliography data available from Acta Cryst E

- January 12, 2011 in inf11, jisc, JISC OpenBib, jiscopenbib, progress

The bibliographic data from Acta Cryst E, a publication by the International Union of Crystallography (IUCr), has been extracted and made available with their consent. You can find a SPARQL endpoint for the data here and the full dataset here. I have also geocoded a number of the affiliations of the authors, plotting them on a timemap (visualising the time of publication against the location of the authors), and you can see this at this location. What you will find:
  • A SPARQL endpoint with limited output capabilities (limited content negotiation).
  • A ‘describe’ method, to display unstyled HTML pages about authors or the papers, based on the given URI.
  • Links from the data in the service to the original papers.
  • The data dump consists of a zipped up directories of rdf, which have most of the intermediary xml, html and other bits removed. Hopefully, this helps explain the odd layout!

Name matching strategy using bibliographic data

- December 1, 2010 in JISC OpenBib, Semantic Web

One of the aims of an RDF representation of bibliographic data should be to have authors represented by unique, reference-able points within the data (as URIs), rather than as free-text fields. What steps can we do to match up the text value representing an author’s name to another example of their name in the data? It’s not realistic to expect a match between say, Mark Twain to Samuel Clemens, without using some extra information typically not present in bibliographic datasets. What can be achieved however, is the ‘fuzzy’ matching of alternate forms of names – due to typos, mistakes and omitted initials and the like. It is important that these matches are understood to be fuzzy and not precise, based more on statistics than a definite assertion. How best to carry out the matching of a set of authors within a bibliographic dataset? This is not the only way, but it is a useful method to make progress with:
  1. List – Gather a list of the things you wish to match, with unique identifiers for each and map out a list of the pairs of names that are required to be compared. (Note, that this mapping will be greatly affected by the next step.)
  2. Filter – Remove the matches that aren’t worth fully evaluating. An index of the text can give a qualitative view on which names are worth comparing and which are not.
  3. Compare – Run through the name-pairs and evaluate the match (likely using string metrics of some kind). The accuracy of the match may be improved by using other data, with some sorts of data drastically improving your chances, such as author email, affiliation (and date) and birth and death dates.
  4. Binding – Bind the names together in whichever manner required. I would recommend creating Bundles as a record of a successful match, and an index or sparql-able service to allow the ‘sameas’ style lookups in a live service.
In terms of the BL dataset within http://bnb.bibliographica.org then: List: We have had to apply a form of an identifier for each instance of an author’s name within the BL dataset. Currently, this is done via a ‘owl:sameas’ property on the original blank node linking to a URI of our own making, eg http://bibliographica.org/entity/735b0…12d033. It would be a lot better if the BL were to mint their own URIs for this, but in the meantime, this gives us enough of a hook to begin with. One way you might gather the names and URIs is via SPARQL:
PREFIX dc: 
PREFIX bibo: 
PREFIX foaf: 
PREFIX skos: 
PREFIX owl: 
SELECT DISTINCT ?book ?authoruri ?name
WHERE {
    ?book a bibo:Book .
    ?book dc:contributor ?authorbn .
    ?authorbn skos:notation ?name .
    ?authorbn owl:sameAs ?authoruri .
}
However, there will be very many results to page through, and it will put a lot of stress on the SPARQL engine if lots of users are doing this heavy querying at the same time! This is also a useful place to gather any extra data you will use at the compare stage (usually email, affiliation or in this particular case, the potential birth and death dates). Filter: This is the part that is quite difficult as you have to work out a method for negating matches without the cost of a full match. If the filter method is slower than simply working through all the matches, then it is not worth doing the step. In matching names from the BL set however, there are many millions of values, but from glancing over the data, I only expect tens of matches or fewer on average for a given name. The method I am using is to make a simple stemming index of the names, with the birth/death dates as extra fields. This I have done in Solr (experimenting with different stemming methods) but come to an odd conclusion that a default english stemming provides suitable groupings. I found this was backed up somewhat by this comparison of string metrics [PDF]. It suggests that a simple index combined with a metric called ‘Jaro’ works well for names. So, in this case, I generate the matching by running the names through an index of all the names and using the most relevant search results as the base for the pairs of names to be matched. The pairs are combined into a set, ordered alphabetically – only the pairing is necessary, not the ordering of the pair. This is so that we don’t end up matching the same names twice. Compare: This is the most awkward step – it is hard to generate a ‘golden set’ of data by which you can rate the comparison without using other data sources. However, the matching algorithm I am using is the Jaro comparison to get a figure indicating the similarity of the names. As the BL data is quite a good set of data (in that it is human-entered and care has been taken over the entry), this type of comparison is quite good – the difference between a positive and a negative match is quite high. Care must be taken to avoid missing false positives from omitted or included initials, middle names, proper forms and so on. The additional data is quite dependant on the match between the names. If the names match perfectly, but the birth dates are very different (different in value and distant in edit distance), then this is likely to be a different author. If the names match somewhat, but the dates match perfectly, then this is a possible match. If the dates match perfectly, but the name doesn’t at all (unlikely due to the above filtering step) then this is not a match. Binding: This step I have not made my mind up about as the binding step is a compromise between recording. Bundling together all the names for a given author in a single bundle if they are below the threshold for a positive match, you get a bundle that requires fewer triples to describe it. However, you really should have a bundle for each pairing, but this dramatically increases the number of triples required to express it. Either way, the method for querying the resultant data is the same. For example: A set of bundles ‘encapsulates’ A, B, C, D, E, F, G – so, given B, you can find the others by a quick, if inelegant, SPARQL query:
SELECT ?sameas
WHERE {
  ?bundle bundle:encapsulates  .
  ?bundle bundle:encapsulates ?sameas .
}
Whether this data should be collapsed into a closure of some sort is up to the administrator – how much must you trust this match before you can use owl:sameAs and incorporate it into a congruent closure? I’m not sure the methods outlined above can give you a strong enough guarantee to do so at this point.

Characterising the British Library Bibliographic dataset

- November 18, 2010 in inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, OKFN Openbiblio, progressPosts

Having RDF data is good. Having Linkable data is better but having some idea of what sorts of properties you can expect to find within a triplestore or block of data can be crucial. That sort of broad-stroke information can be vital in letting you know when a dataset contains interesting data that makes the work to use it worthwhile. I ran the recently re-released BL RDF data (get from here or here) (CC0) through a simple script that counted occurrences of various elements within the 17 files, as well as enumerating all the different sorts of property you can expect to find. Some interesting figures:
  • Over 100,000 records in each file, 2.9 million ‘records’ in total. Each record is a blank node.
  • Three main types of identifier – a ‘(Uk)123….’, ‘GB123…’ and (as a literal) ‘URN:ISBN:123…’, but not all records have ISBNs as some of them predate it.
  • Nearly 29 million blank nodes in total.
  • 11,187,804 uses of dcterms:subject, for an average of just under 4 per record (3.75…)
  • Uses properties from Dublin Core terms, OWL-Time, ISBD, and SKOS
  • dcterms:subject’s are all as SKOS declarations, and include the Dewey decimal, LCSH and MESH schemes. (Work to use id.loc.gov LCSH URIs instead of literals is underway)
  • Includes rare and valuable information, stored in properties such as dcterms:isPartOf, isReferencedBy, isReplacedBy, replaces, requires and dcterms:relation.
Google spreadsheet of the tallys Occurrence trends through the 17 data files (BNBrdfdc01.xml –> 17.xml) (The image is as Google spreadsheet exported it, click on the link above to go to the sheet itself to view it natively without axis distortion.) Literals and what to expect: I wrote another straightforward script that can mine sample sets of unique literals from the BNBrdfdc xml files. Usage for ‘gather_test_literals.py’ Usage: python gather_test_literals.py path/to/BNBrdfdcXX.xml ns:predicate number_to_retrieve [redis_set_to_populate] For example, to retrieve 10 literal values from the bnodes within dcterms:publisher in BNBrdfdc01.xml: python gather_test_literals.py BNBrdfdc01.xml "dcterms:publisher" 10
  • and to also push those values into a local Redis set 'publisherset01' if Redis is running and redis-py is installed:
python gather_test_literals.py BNBrdfdc01.xml "dcterms:publisher" 10 publisherset01
So, to find out what, at most, 10 of those intriguing ‘dcterms:isReferencedBy’ predicates contain in BNBrdfdc12.xml, you can run:
python gather_test_literals.py BNBrdfdc12.xml "dcterms:isReferencedBy" 10
(As long as gather_test_literals.py and the xml files are in the same directory of course) Result: Chemical abstracts, Soulsby no. 4061 Soulsby no. 3921 Soulsby no. 4018 Chemical abstracts As the script gathers the literals into a set, it will only return when it has either reached the desired number of unique values, or has reached the end of the file. Hopefully, this will help other people explore this dataset and also pull information from it. I have also created a basic Solr configuration that has fields for all the elements found in the BNB dataset here.

"Bundling" instances of author names together without using owl:sameas

- November 17, 2010 in inf11, jisc, JISC OpenBib, jiscEXPO, jiscopenbib, ORDF, progress, projectMethodology, rdf, Semantic Web

Bundling? It’s a verb I’ve taken from ”Glaser, H., Millard, I., Jaffri, A., Lewy, T. and Dowling, B. (2008) On Coreference and The Semantic Web http://eprints.ecs.soton.ac.uk/15765/” where the core idea is that you have a number of URIs that mean or reference the same real thing, and the technique they describe of bundling is to aggregate all those references together. The manner in which they describe is built on a sound basis in logic, and is related to (if not the same as) a congruent closure. The notion of bundling I am using is not as rooted in terms of mathematical logic, because I need to convey an assertion that one URI is meant to represent the same thing that another URI represents in a given context and for a given reason. This is a different assertion, if only subtly different, than ‘owl:sameas’ asserts, but the difference is key for me. It is best to think through an example of where I am using this – curating bibliographic records and linking authors together. It’s an obvious desire – given a book or article, to find all the other works by an author of that said work. Technologically, with RDF this is a very simple proposition BUT the data needs to be there. This is the point where we come unstuck. We don’t really have that quality of data that firmly establishes that one author is the same as a number of others. String matching is not enough! So, how do we clean up this data (converted to RDF) so that we can try to stitch together the authors and other entities in them? See this previous post on augmenting British Library metadata so that the authors, publishers and so on are externally reference-able once they are given unique URIs. This really is the key step. Any other work that can be done to make any of the data about the authors and so on more semantically reference-able will be a boon to the process of connecting the dots, as I have done for authors with birth and/or death dates. The fundamental aspect to realise is that we are dealing with datasets which have missing data, misrepresented data (typos), misinterpreted fields (ISBNs of £2.50 for example) and other non-uniform and irregular problems. Connecting authors together in datasets with these characteristics will rely on us and code that we write making educated guesses, and probabilistic assertions, based on how confident we are that things match and so on. We cannot say for sure that something is a cast-iron match, only that we are above a certain limit of confidence that this is so. We also have to have a good reason as well. Something else to take on board is that what I would consider to be a good match might not be good for someone else so there needs to be a manner to state a connection and to say why, who and how this match was made as well as a need to keep this data made up of assertions away from our source data. I’ve adopted the following model for encoding this assertion in RDF, in a form that sits outside of the source data, as a form of overlay data and you can find the bundle ontology I’ve used at http://purl.org/net/bundle.rdf (pay no attention to where it is currently living): Click to view in full, unsquished form: Bundle of URIs, showing use of OPMV The URIs shown to be ‘opmv:used’ in this diagram are not meant to be exhaustive. It is likely that a bundle may depend on a look-up or resolution service, external datasheets, authority files, csv lists, dictionary lists and so on. Note that the ‘Reason’ class has few, if any, mandatory properties aside from its connection to a given Bundle and opmv:Process. Assessing if you trust a Bundle at this moment is very much based on the source and the agent that made the assertion. As things get more mature, more information will regularly find its place attached to a ‘Reason’ instance. There are currently two subtypes of Reason: AlgorithmicReason and AgentReason. Straightforwardly, this is the difference between a machine-made match and a human-made match and use of these should aid the assessment of a given match. Creating a bundle using python: I have added a few classes to Will Waites’ excellent ‘ordf’ library, and you can find my version here. To create a virtualenv to work within, do as follows. You will need mercurial and virtualenv already installed: At a command line – eg ‘[@localhost] $’, enter the following:
hg clone http://bitbucket.org/beno/ordf
virtualenv myenv
. ./myenv/bin/activate
(myenv) $ pip install ordf
So, creating a bundle of some URIs – “info:foo” and “info:bar”, due to a human choice of “They look the same to me :) ”: In python: code here from ordf.vocab.bundle import Bundle, Reason, AlgorithmicReason, AgentReason from ordf.vocab.opmv import Agent from ordf.namespace import RDF, BUNDLE, OPMV, DC # you are likely to use these yourself from ordf.term import Literal, URIRef # when adding arbitrary triples b = Bundle() """or if you don't want a bnode for the Bundle URI: b = Bundle(identifier="http://example.org/1")""" """ NB this also instantiates empty bundle.Reason and opmv.Process instances too in b.reason and b.process which are used to create the final combined graph at the end""" b.encapsulate( URIRef("info:foo"), URIRef("info:bar") ) """ we don't want the default plain Reason, we want a human reason:""" r = AgentReason() """ again, pass a identifier="" kw to set the URI if you wish""" r.comment("They look the same to me :) ") """Let them know who made the assertion:""" a = Agent() a.nick("benosteen") a.homepage("http://benosteen.com") """ Add this agent as the controller of the process:""" b.process.agent(a) g = b.bundle_graph() # this creates an in-memory graph of all the triples required to assert this bundle """ easiest way to get it out is to "serialize" it:""" print g.serialize() ============== Output:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:bundle="http://purl.org/net/bundle#"
   xmlns:foaf="http://xmlns.com/foaf/0.1/"
   xmlns:opmv="http://purl.org/net/opmv/ns#"
   xmlns:ordf="http://purl.org/NET/ordf/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
>
  <rdf:Description rdf:nodeID="PZCNCkfJ2">
    <rdfs:label> on monster (18787)</rdfs:label>
    <ordf:hostname>monster</ordf:hostname>
    <ordf:pid rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">18787</ordf:pid>
    <opmv:wasControlledBy rdf:nodeID="PZCNCkfJ9"/>
    <ordf:version rdf:nodeID="PZCNCkfJ4"/>
    <rdf:type rdf:resource="http://purl.org/net/opmv/ns#Process"/>
    <ordf:cmdline></ordf:cmdline>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ0">
    <bundle:encapsulates rdf:resource="info:bar"/>
    <bundle:encapsulates rdf:resource="info:foo"/>
    <bundle:justifiedby rdf:nodeID="PZCNCkfJ5"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
    <rdf:type rdf:resource="http://purl.org/net/bundle#Bundle"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ5">
    <rdf:type rdf:resource="http://purl.org/net/bundle#Reason"/>
    <opmv:wasGeneratedBy rdf:nodeID="PZCNCkfJ2"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ9">
    <foaf:nick>benosteen</foaf:nick>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/>
    <foaf:homepage rdf:resource="http://benosteen.com"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="PZCNCkfJ4">
    <rdfs:label>ordf</rdfs:label>
    <rdf:value>0.26.391.901cf0a0995c</rdf:value>
  </rdf:Description>
</rdf:RDF>

Given a triplestore with these bundles, you can query for ‘same as’ URIs via which Bundles a given URI appears in.