You are browsing the archive for Ideas.

From CMS to DMS: C is for Content, D is for Data

- March 9, 2012 in Featured, Ideas, Open Standards

This is a joint blog post by Francis Irving, CEO of ScraperWiki, and Rufus Pollock, Founder of the Open Knowledge Foundation. It’s being cross-posted to both blogs.

Content Management Systems, remember those?

Tim Berners-Lee in thought It’s 1994. You haven’t heard of the World Wide Web yet. Your brother goes to a top university. He once overheard some geeks in the computer room making a ‘web site’ consisting of a photo tour of their shared house. He thought it was stupid, Usenet is so much better. The question – in 1994 did you understand what a Content Management System (CMS) was? In the intervening years, CMS’s have gone through ups and downs. Building massive businesses, crashing in the .com collapse. Then a glut, web design agencies all building their own CMS in the early noughties. Ending up with the situation now. A mature market, commoditised by open source WordPress. Anyone can get a page on the web using Facebook. There’s still room for expensive, proprietary players, newspapers custom make their own, and businesses have fancy intranets.

Data Management Systems, time to meet them!

DMSs are also called "data hubs". Hopefully less patented than this wheel!

It’s 2012. You’ve just about heard of Open Data. Your nephew researches the Internet at a top university. He says there’s no future in Open Data, no communities have formed round it. Companies aren’t publishing much data yet, and Governments the wrong data reluctantly. The question – what is a Data Management System (DMS)? There isn’t a very good one yet. We’re at round about where CMS’s were in the mid 1990s. Most people get by fine without them. Just as then we wrote HTML in text files by hand and uploaded it by FTP, now we analyse data on our laptops using Excel, and share it with friends by emailing CSV files. But it reaches the point where using the filesystem and Outlook as your DMS stretches to breaking point. You’ll need a proper one. Nobody really knows what a proper one will look like yet. We’re all working on it. But we do know what it will enable.

What must a DMS do?

All the things people expect a DMS to do!

A mature DMS will let people do all the following things. Whether as a proprietary monolith, or by slick integration across the web:
  • Load and update data from any source (ETL)
  • Store datasets and index them for querying
  • View, analyse and update data in a tabular interface (spreadsheet)
  • Visualise data, for example with charts or maps
  • Analyse data, for example with statistics and machine learning
  • Organise many people to enter or correct data (crowd-sourcing)
  • Measure and ensure the quality of data, and its provenance
  • Permissions; data can be open, private or shared
  • Find datasets, and organise them to help others find them
  • Sell data, sharing processing costs between users
If it sounds like a fat list for a product, that’s because it is. But sometimes the need, the market, pulls you – something simple just won’t do. It has to do or enable, best it can, everything above. (Compare it to the same list for CMSs) In short, it’s what the elite data wrangling teams inside places like Wolfram Alpha and Google’s Metaweb teams do. But made easier and more visible using standardised tools and protocols.

Who’s making a DMS?

More people than I realise. From the largest IT company to the tiniest startup. Here are some I know about, mention more in the comments:
  • Windows / OSX (+ Excel / LibreOffice / …) – the desktop serves as a (good enough so far) DMS
  • CKAN software – started as a data catalog, but has grown into more and powers the DataHub, a community data hub and market. Created by the Open Knowledge Foundation
  • ScraperWiki- coming from the viewpoint of a programmer, good at ETL
  • Infochimps/DataMarket – approaching it as a data marketplace
  • BuzzData – specialising in the social aspects
  • Tableau Public – specialising in visualisation
  • Google Spreadsheets – coming from the web spreadsheet direction
  • Microsoft Data Hub – corporate information management
  • PANDA – making a DMS for newsrooms
They’re all DMS’s because they all naturally grow bad versions of each other’s features. Two examples. ScraperWiki is particularly good at complex ETL (loading data into a system), yet every DMS has to have a data ingestion interface of at least choosing CSV columns. CKAN has particularly good metadata, usage and provenance, yet every DMS has to have a way for people to find the data stored in it.

So will they be giant monolithic bits of software?

We standardised the shipping container, can we standardise data interoperation?

We hope not! That didn’t turn out great for CMSs, although there are some businesses providing that. CMS’s only really came of age when in the mid-noughties everyone realised that WordPress (open source blogging software!) was a better CMS than most CMS’s. It’s in everyone’s interest that users aren’t locked into one DMS. One of them might have a whizzy content analysis tool that somebody who has data in another DMS wants to use. They should be able to, and easily. OKFN is about to launch a standards initiative to bring together such things. It’s called Data Protocols. So far the clearest needs are twofold and mirror each other – pulling and pushing data: a) a data query protocol/format to allow realtime querying, for example for exploring data. Imagine a Google Refine instance live querying a large dataset on OKFN’s the Data Hub. b) a data sync protocol/format that is liken to CouchDB’s protocol. It would let datasets get updated in real time across the web. Imagine a set of scrapers on ScraperWiki automatically updating a visualisation on Many Eyes as the data changed. Later even more imaginative things… I reckon Google’s Web Intents can be used to make the whole experience of the user slick when using multiple DMS’s at once. And hopefully somebody, somewhere is making a simplified version of SPARQL/RDF just as XML simplified SGML and then really took off. Enough of me! What do you think? Join in. Make standards. Write code. Leave a comment below, and join the data protocols list.

Dreams of a Unified Text

- January 24, 2012 in Ideas, Musings, Open Knowledge, Public Domain

The following is a blog post by Rufus Pollock co-Founder of the Open Knowledge Foundation. I have a dream, one which I’ve had for a while. In this dream I’m able to explore, seamlessly, online, every text ever written. With the click of a button I can go from Pynchon to Proust, from Musil to Machiavelli, from Homer to Hugo. And in this dream not only can I read, but I myself am able to contribute, to write upon these texts — to annotate, to anthologize, to interlink, to translate, to borrow — and to share what I do with others. I can see what others have shared, what notes they have added, what selections they have made. I can see the interweaving of these texts created by borrowing, by inspiration, by reference, all made concrete by the insight and efforts of myself and others and their ability to layer their insights freely upon those original texts — just as those writers built upon the works that had gone before them. And while each text still can stand still stand alone — in all its greatness or mediocrity — we have something new, a single unified corpus woven together out of this multitude of separate text — e pluribus unum. A whole that is a concrete instantiation in an immaterial realm of the cultural achievement of mankind as expressed in the written word.

Dream Meets Reality

Why is this dream not yet a reality? After all don’t we have the tools and technology. One answer is legal, one answer is technological, and one answer is social. The legal issue is copyright, at least in its current exclusive rights form 1. Copyright means this vision is only really possible for works in the public domain, works therefore that are, in most countries, a hundred years or more old. This isn’t necessarily that big a problem, at least for texts: the public domain though old is already incredibly rich and so we therefore already have more than enough material to be getting on with. On the technology front we have the cost of digitization, processing and storage. Digitization costs are significant. This has meant either that digitization activities have either been limited or the material created has not been released openly (for example, the material produced by Google’s efforts with its Books project, which is probably largest effort to date, is not open). That said, efforts like Project Gutenberg and the Internet Archive have already made available tens of thousands of texts, and there are now several digitization projects underway that will result in even larger amounts of material freely and openly available. Then third we have the social issue, or rather it a question of how technology can support the social activities required for this dream of a unified text to become real. Specifically, to realize our dream we need to bring material — texts and the writing upon them — together in a single coherent experience. Yet the centralization (and ownership) that implies may be a significant obstacle to mass participation.2 Similarly, we need it to be possible for anyone with ‘net access to be able to contribute to the weaving of the unified inter-text but, at the same time, to be able to select which contributions we want to see (if we are not to be overwhelmed by an avalanche of material, much of it possibly of dubious quality).


We have then within our grasp, the realization of th dream of a unified text. Combining of text of technology we can create something truly extraordinary. Interested in making this happen, come join us at the Textus Project.

  1. Let me be clear, I’m not saying that copyright is per se is bad or that everything should be ‘free’. Time, energy and capital are required to create books, music and films and that expenditure often needs to be recompensed. However, the current system of copyright is by no means the best way to achieve this. This is not something I wish to explore in detail here. More can be found on my personal website and in papers such as Forever Minus a Day: Theory and Empirics of Optimal Copyright 
  2. This tension between distributed collaboration and centralizing tendencies of coordination and scale is a common theme in many ‘net projects. 

Ideas for

- December 20, 2011 in Bibliographic, Free Culture, Ideas, Open Content, Open Data, Public Domain, WG Cultural Heritage, WG Humanities, WG Public Domain, Working Groups

The following post is from Jonathan Gray, Community Coordinator at the Open Knowledge Foundation. It is cross-posted from
For several years I’ve been meaning to start, which would be a collection of open resources related to philosophy for use in teaching and research. There would be a focus on the history of philosophy, particularly on primary texts that have entered the public domain, and on structured data about philosophical texts. The project could include:
  • A collection of public domain philosophical texts, in their original languages. This would include so called ‘minor’ figures as well as well known thinkers. The project would bring together texts from multiple online sources – from projects like Europeana, the Internet Archive, Project Gutenberg or Wikimedia Commons, to smaller online collections from libraries, archives, academic departments or individual scholars. Every edition would be rights cleared to check that it could be freely redistributed, and would be made available either under an open license, with a rights waiver or a public domain dedication.
  • Translations of public domain philosophical texts, including historical translations which have entered the public domain, and more recent translations which have been released under an open license.
  • Ability to lay out original texts and translations side by side – including the ability to create new translations, and to line up corresponding sections of the text.
  • Ability to annotate texts, including private annotations, annotations shared with specific users or groups of users, and public annotations. This could be done using the Annotator tool.
  • Ability to add and edit texts, e.g. by uploading or by importing via a URL for a text file (such as a URL from Project Gutenberg). Also ability to edit texts and track changes.
  • Ability to be notified of new texts that might be of interest to you – e.g. by subscribing to certain philosophers.
  • Stable URLs to cite texts and or sections of texts – including guidance on how to do this (e.g. automatically generating citation text to copy and paste in a variety of common formats).
The project could also include a basic interface for exploring and editing structured data on philosophers and philosophical works:
  • Structured bibliographic data on public domain philosophical works – including title, year, publisher, publisher location, and so on. Ability to make lists of different works for different purposes, and to export bibliographic data in a variety of formats (building on existing work in this area – such as Bibliographica and related projects).
  • Structured data on secondary texts, such as articles, monographs, etc. This would enable users to browse secondary works about a given text. One could conceivably show which works discuss or allude to a given section of a primary text.
  • Structured data on the biographies of philosophers – including birth and death dates and other notable biographical and historical events. This could be combined with bibliographic data to give a basic sense of historical context to the texts.
Other things might include:
  • User profiles – to enable people to display their affiliation and interests, and to be able to get in touch with other users who are interested in similar topics.
  • Audio version of philosophical texts – such as from Librivox.
  • Links to open access journal articles.
  • Images and other media related to philosophy.
  • Links to Wikipedia articles and other introductory material.
  • Educational resources and other material that could be useful in a teaching/learning context – e.g. lecture notes, slide decks or recordings of lectures.
While there are lots of (more or less ambitious!) ideas above, the key thing would be to develop the project in conjunction with end users in philosophy departments, including undergraduate students and researchers. Having something simple that could be easily used and adopted by people who are teaching, studying or researching philosophy or other humanities disciplines would be more important that something cutting edge and experimental but less usable. Hence it would be really important to have a good, intuitive user interface and lots of ongoing feedback from users. What do you think? Interested in helping out? Know of existing work that we could build on (e.g. bits of code or collections of texts)? Please do leave a comment below, join discussion on the open-humanities mailing list or send me an email! ideas

- November 8, 2011 in, Ideas, uk-gov

There are lots of cool ideas for how to reuse #opendata on - should have a look and make sure we are not missing anything! ideas

- November 8, 2011 in, Ideas, uk-gov

There are lots of cool ideas for how to reuse #opendata on - should have a look and make sure we are not missing anything!

A translation fund for public domain texts

- November 2, 2011 in Free Culture, Ideas, Public Domain, Public Domain Works, WG Public Domain

The following post is from Jonathan Gray, Community Coordinator at the Open Knowledge Foundation. It was originally posted on his blog.
If a text is widely known and published more than a century and a half ago, chances are that it will be freely available on the web to read and download. Every person with an internet connection has access to a vast wealth of cultural and historical material: novels and poems, essays and manifestos, constitutions and scriptures. As well as accessing and sharing this material, the law says that anyone can translate and republish works which have entered the public domain. But translations constitute new creative works and are hence covered by copyright and related rights, which means that by default they cannot be shared online. This is, of course, perfectly understandable. There is money to be made from producing new translations of classic works, which means publishers and translators are incentivised to assert their rights. Literary translation is a fine art: translators must unpick constellations of connotation and navigate between the Scylla and Charybdis of fidelity and perspicuity as they reconstitute the work they are translating into its target language. It is natural to reward translators in the same manner we reward authors of original texts – for translations often are new literary works. Things like Seamus Heaney’s rendering of Beowulf, Baudelaire’s Edgar Allen Poe, or Schegel’s Shakespeare testify to this. So if I want to read a work in a language that I do not understand, I must go to a bookshop and buy a new translation. Such is life. But wouldn’t it be nice if some new translations of public domain texts were freely available for people to read online? If the commercial translations were complemented with a stronger culture of translators sharing the fruits of their labour? One could imagine this could be encouraged with a mixture of stronger norms and alternative incentives. For example, students could be encouraged to share translations made during the course of their studies. There could be more avenues for scholars and professional translators to publish works which they are unlikely to get a contract to publish or derive income from, such as shorter or more obscure works. And there could be awards, stipends or bursary funds for outstanding translations of public domain works which were freely published on the web. At the Public Domain Review we’ve been thinking about how a literary translation fund for public domain texts might work. We’re currently thinking:
  • There could be an initial focus on short works (e.g. under 10,000 words), with a token stipend or cash prize to recognise outstanding translations.
  • It could be overseen by an advisory group of writers, scholars, translators, publishers and critics – who would help to give direction and focus to the fund, evaluate submissions and publicise it.
  • Translations would be published under a Creative Commons Attribution or Creative Commons Attribution Sharealike license and uploaded to the Internet Archive, Project Gutenberg or Wikisource.
  • It could be financially supported by a mixture of cultural and academic funding bodies and augmented with sponsorship from the private sector (publishers, literary publications, technology companies, etc).
We’d like to try and launch a small fund to do this to coincide with Public Domain Day 2012. Do you have thoughts about how this could work? Know of anything like it that already exists? Or know people or bodies who might be interested in supporting this? If you have any cunning ideas, please do send me a message or leave a comment below!

Dear Internet, we need better image archives

- October 3, 2011 in Guest post, Ideas, Public Domain, WG Public Domain

The following guest post is by Nina Paley, cartoonist and blogger. Nina is a member of the OKF’s Working Group on the Public Domain. Dear Internet, You know what should be really easy to find online? Good quality, Public Domain vintage illustrations. You know, things like this: I found this on Flickr, where someone claims full copyright on it. That’s copyfraud, but understandable because Flickr’s default license is full copyright (all the more reason to ignore copyright notices!). But copyfraud isn’t the main problem. The main problem is that images like this are painfully difficult to find online, especially at high resolutions (and this image is only available at medium resolution – up to 604 pixels high, which is barely usable for most purposes but higher than much of what you find online). The images are out there – and with zillions of antique books being scanned, their vintage illustrations are being scanned right along with them. But the images are buried in the text, and often the scan quality is poor. Images should be scanned at high quality, and tagged for searchability.

Are archives ignoring the value of images?

Take the American Memory archive of the Library of Congress. Lots and lots of historical documents here, but no way for me to find an image of, say, a horse. Most book-scanning projects focus on texts, not illustrations. Many interesting and useful illustrations are buried within these scans, uncatalogued and inaccessible. Scan quality is set for text, not illustrations, so even if one can find a choice illustration buried within, its quality is usually too low to use. is great (I love you,!) but does not have an image archive. Still images are not among their “Media Types” (which consist of Moving Images, Texts, Audio, Software, and Education). So I went spelunking through their texts, starting with “American Libraries,” and searched for something easy: “horse.” Surely I could find a nice usable etching of a horse in there somewhere. I eventually found “The Harness Horse” by Sir Walter Gilbey, from 1898. Nice illustrations! Can I use them? Unfortunately, no. The book is downloadable as PDF and various e-publication formats, but when I try to extract the illustrations, I get a mess: horse copy and pasted Copied and pasted from Adobe Acrobat. WTF? horse copy and pasted inverted The same image, inverted. Doesn’t work. save image as “Save Image as…” from Acobat. This worked, except where it didn’t: part of the image is simply missing. Clearly something is messed up here. Was it just that page? Alas, no: This sad image from another page has the same problem. The scans have some flaws that PDFs and Photoshop can’t cope with: Screen grab of zoomed-in view from Acrobat. What looks like a blur in the PDF renders the image unusable when extracted. These images are not useable, which is a pity because they are very nice illustrations. And they seem to be among the higher quality scans, which again isn’t saying much. Let me add that it’s great these books are being scanned at all! That’s definitely better than losing them entirely. But as an artist, it saddens me that we’re neglecting this wealth of visual art. I’d like to see our rich visual history properly archived. Our bias favoring text over pictures is especially ironic considering how much more efficiently information is communicated to humans through images; “A picture is worth a thousand words,” or more. That’s why I’m a cartoonist, after all. I was able to extract one clean image from the book, on page 48: Unfortunately I can’t use this illustration for my purposes, but maybe someone else can. I’ve already gone through the trouble of finding it in a text, extracting it, and rotating it. If only there were some image archive I could upload it to at high resolution, so someone else could use it. I could tag it, to make it easier to find. I could include all kinds of useful metadata, like what book it was from and when it was published; but even if that was too bothersome, I could at least include tags like “horse,” “rider” and “engraving.” Wouldn’t it be nice if such an archive existed? Wikimedia Commons is close, although I dread uploading things there after having all my open-licensed comics deleted by an overzealous editor. But maybe they’re our best hope. Continuing my searches on, I found this ostensibly Public Domain, vintage horse book with line illustrations. Unfortunately this is controlled by Google Books. It’s “free” to read online in Google’s reader, which doesn’t allow any image export. It also doesn’t allow me to zoom in. All those illustrations, trapped at low resolution, unusable (even if they were tagged/catalogued, which they aren’t). This is our “Public Domain.” Who exactly is benefitting from having these 18th Century illustrations inaccessible to today’s artists? Then there’s Dover Books. I loved Dover books growing up – they introduced me to the idea of the Public Domain. Dover reproduces vintage illustrations in books for artists and designers. Their paper books were reasonably priced, and you could use the illustrations for anything, without restriction. Browsing was free, so I would flip through the pages in the book store, and if it had what I needed, I’d buy it. Dover is still selling books, but the prices are now relatively high, few are carried in bookstores, and they prohibit browsing online. You have to shell out $15 to find out if what you need is in the book, and how could you know? They seem to be clinging to an outdated copyright model, and rather than selling things of added value, they are simply blocking access to existing Public Domain works, in order to collect a toll. What else has kept a good public archive of Public Domain images from existing? Some artists and archivists do make high quality scans of vintage illustrations – and keep them to themselves. I guess we could call this “image hoarding.” I assume the reasoning is, “I went through all the trouble to scan it, why should I share? Others can pay me if they want a copy.” Also there’s the “finders, keepers” reasoning: “anyone else is free to find the same illustration in another antique book, but I found this one, so it’s mine.” And so these images remain inaccessible, not part of any public archive. Wikimedia Commons is the best public image archive I know of right now. A bit of searching led me to their “Engravings of Horses” category, which yielded some nice images. Unfortunately, many of these are not available at sufficiently high resolutions. The maximum size of this image is 800 × 608 pixels, which limits its use. Limited image sizes and limited selection have been the biggest obstacles to my relying more on Wikimedia Commons; but it can get better. Maybe it will. It would be nice if something became the public vintage image archive I and so many other artists need.

Data-Driven Journalism In A Box: what do you think needs to be in it?

- September 12, 2011 in Data Journalism, ddj, ejc, Events, Ideas, mozilla

The following post is from Liliana Bounegru (European Journalism Centre), Jonathan Gray (Open Knowledge Foundation), and Michelle Thorne (Mozilla), who are planning a Data-Driven Journalism in a Box session at the Mozilla Festival 2011, which we recently blogged about here. This is cross posted at and on the Mozilla Festival Blog. We’re currently organising a session on Data-Driven Journalism in a Box at the Mozilla Festival 2011, and we want your input! In particular:
  • What skills and tools are needed for data-driven journalism?
  • What is missing from existing tools and documentation?
If you’re interested in the idea, please come and say hello on our data-driven-journalism mailing list! Following is a brief outline of our plans so far…

What is it?

The last decade has seen an explosion of publicly available data sources – from government databases, to data from NGOs and companies, to large collections of newsworthy documents. There is an increasing pressure for journalists to be equipped with tools and skills to be able to bring value from these data sources to the newsroom and to their readers. But where can you start? How do you know what tools are available, and what those tools are capable of? How can you harness external expertise to help to make sense of complex or esoteric data sources? How can you take data-driven journalism into your own hands and explore this promising, yet often daunting, new field? A group of journalists, developers, and data geeks want to compile a Data-Driven Journalism In A Box, a user-friendly kit that includes the most essential tools and tips for data. What is needed to find, clean, sort, create, and visualize data — and ultimately produce a story out of data? There are many tools and resources already out there, but we want to bring them together into one easy-to-use, neatly packaged kit, specifically catered to the needs of journalists and news organisations. We also want to draw attention to missing pieces and encourage sprints to fill in the gaps as well as tighten documentation.

What’s needed in the Box?

  • Introduction
    • What is data?
    • What is data-driven journalism?
    • Different approaches: Journalist coders vs. Teams of hacks & hackers vs. Geeks for hire
    • Investigative journalism vs. online eye candy
  • Understanding/interpreting data:
    • Analysis: resources on statistics, university course material, etc. (OER)
    • Visualization tools & guidelines – Tufte 101, bubbles or graphs?
    • Acquiring data
  • Guide to data sources
  • Methods for collecting your own data
  • FOI / open data
  • Scraping
    • Working with data
  • Guide to tools for non-technical people
  • Cleaning
    • Publishing data
  • Rights clearance
  • How to publish data openly.
  • Feedback loop on correcting, annotating, adding to data
  • How to integrate data story with existing content management systems

What bits are already out there?

What bits are missing?

  • Tools that are shaped to newsroom use
  • Guide to browser plugins
  • Guide to web-based tools

Opportunities with Data-Driven Journalism:

  • Reduce costs and time by building on existing data sources, tools, and expertise.
  • Harness external expertise more effectively
  • Towards more trust and accountability of journalistic outputs by publishing supporting data with stories. Towards a “scientific journalism” approach that appreciates transparent, empirically- backed sources.
  • News outlets can find their own story leads rather than relying on press releases
  • Increased autonomy when journalists can produce their own datasets
  • Local media can better shape and inform media campaigns. Information can be tailored to local audiences (hyperlocal journalism)
  • Increase traffic by making sense of complex stories with visuals.
  • Interactive data visualizations allow users to see the big picture & zoom in to find information relevant to them
  • Improved literacy. Better understanding of statistics, datasets, how data is obtained & presented.
  • Towards employable skills.