You are browsing the archive for Labs.

Exposing legacy project datasets in Digital Humanities: King’s Digital Lab experience

- July 22, 2020 in Labs, Open Research

exposing-datasets This is a repost of a blog published by Arianna Ciula on Kings Digital Lab Blog on July 7, 2020 In this blogpost we share our experiences at King’s Digital Lab (KDL). While we can call the process a success overall (and you can read more about it in this article and in the summary of our current archiving and sustainability approach), the road has been bumpy and we stumbled across some interesting challenges along the way. In this blogpost we talk about how we made use of the Open Knowledge Foundation’s open source data portal platform CKAN to catalogue and make visible some of our legacy projects’ data.
KDL adopted CKAN following assessment of the institutional repository in place at the time as well as comparisons of research data management platforms in the literature (e.g. on ‘data FAIRification’ see van Erp, J. A. et al. 2018). While this is a solution that might encounter changes over time (including data migration or mapping to and aggregation in other repositories), at the moment it is fit for purpose in that it provides a metadata catalogue to store or to point to some of our legacy projects datasets – and associated contextual documentation – which were not accessible before, expanding substantially the potential for data and resources to be discovered, re-used and critiqued.
First things first, a step back to what KDL is about and what the data we inherited and produced entail. KDL builds on a recent yet relatively long history – for the field of Digital Humanities – of creating tools and web resources in collaboration with researchers in the arts and humanities as well as the cultural heritage sector. While KDL started operation as a team of Research Software Engineers within the Faculty of the Arts and Humanities at King’s College London (UK) in 2015, some of the projects we inherited were developed 5, 10 or even 20 years before the Lab’s existence. Out of the ca. 100 legacy projects, some started in the late 1990s or early 2000s out of many collaborative projects led or co-led by the Department of Digital Humanities (DDH). The tools and resources KDL inherited span a wide spectrum from text analysis and annotation tools, digital corpora of texts, images and musical scores to digital editions, historical databases and layered maps.
The resources you will find in the KDL CKAN instance aren’t numerous but our plan is to increase their number as further support is obtained for a project undertaken by KDL (in particular with the involvement of Samantha CallaghanPaul CatonArianna CiulaNeil JakemanBrian MaherPam MellenMiguel VieiraTim Watts) in collaboration with colleagues and students at the Department of Digital Humanities (Paul SpenceKristen Shuster, Minmin Yu). This work was a continuation of the wider archiving and sustainability effort described in Smithies et al. 2019 article and was possible through a seed fund grant offered by DDH and complemented by a student internship on the MA in Digital Humanities.
The datasets and resources we collected and catalogued range from summarised (so called ‘calendared’) editions of medieval documents to collections of modern correspondence, from ontologies adapted to express complex entities and relations in medieval documents to corresponding guides for encoding and data modelling. The default CKAN mask mapped to international cataloguing standards allows the capture of important dataset information such as creator and maintainer details, version etc. However, given that our legacy datasets are mainly project-based, we also decided to enhance the catalogue with project metadata (see related code at our github repository) ranging from information about the collaborative teams (typically including academics, archivists, designers, software engineers, analysts) to details on funders and period of activity. This slight modification of the data ingest form was then re-used in a currently active project – MaDiH (مديح): Mapping Digital Cultural Heritage in Jordan – which is looking at scoping the landscape of Jordanian cultural heritage datasets and also opted for a KDL-hosted CKAN instance as its core solution architecture (the code associated to other MaDiH-specific CKAN extensions including detailed tagging for time periods and data types is available at this other github repository).
CKAN mask first part

The CKAN mask for KDL instance (part 1)
CKAN mask entry second part

The CKAN mask for KDL instance (part 2)
What follows is the workflow we implemented for our cataloguing project:
  1. Dataset and resource selection
  2. Preliminary data entries by analyst and/or supervised student
  3. Internal peer review
  4. Communications with partners providing project overview, outline of benefits, some technical details (information on CKAN, list of resources to be exposed, license for the data, preview details) and requesting consent
  5. Data publication (if consent obtained)
  6. Public comms and dissemination (e.g. on social media)
  7. Creation of Digital Object Identifier at dataset level via DataCite membership of King’s College London library
  8. Update of citation field when DOI obtained
With respect to point 1 and 2, the cataloguing information provided for each project (equivalent to dataset in CKAN parlance) is rather high level; however, even at this rather minimally functional level, more often than not, digging into legacy documentation is not trivial and requires making tacit knowledge within the lab explicit or contacting partners to elicit further context, information and rationale for resource selection and ingestion. For example, despite KDL legacy projects being informed by best practices in digital humanities such as use of standards and general openness to data re-use, licences were not always agreed at the time of data creation, in some cases leaving room for interpretation or substantial discussions regarding data ownership and exposure. In addition, in academic research, even when projects are long completed and unfunded, often collected and created data continue to be manipulated and analysed to inform further research and new arguments. While we had agreed to expose data which were considered ‘complete’, often multiple versions of the ‘same’ resource co-exist (adequately time-stamped or contextualised in narrative form) to showcase the constructed nature of this material and their workflows.
Data exposure and publication has now become a key element in King’s Digital Lab’s approach to project development as well as to our archiving and sustainability model. Dataset deposit within the Lab or as part of institutional technical systems as well as external repositories is an option assessed at several stages of a project lifecyle, from initial conversations with project partners when discussing a new project idea to post-funding phase and maintenance of legacy projects (see more on our approach on this guidance to research data management). Data publication on the KDL CKAN instance addresses mainly the issue of hidden datasets for our legacy projects at the moment; however, cataloguing projects metadata and exposing project datasets via CKAN is one of the options KDL currently offers also to new project partners.
Not only can shifting from systems to data ease the maintenance burden of many long-running projects, but it opens up possibilities for data re-use, verification and integration beyond siloed resources. Data exposure is however not enough to ensure access, and should not mask the need for attention to standards, workflows, systems and services (see recent ALLEA report on “Sustainable and FAIR Data Sharing in the Humanities”). This is where attention to tailored project solutions to research questions and domains while at the same time attempting to align to existing community standards within the Linked Open data paradigm continues to be a challenging yet fruitful area of research and ongoing activities at KDL. For example, our research software engineers are currently working towards integrating the web framework application most used in KDL’s technical stack – Django – with relevant APIs to align to specific standards (e.g. bibliographic RDF data models; Linked Open Data resources for people and location entities) or to extend them as needed with project code published on relevant software repositories (see under an open licence. Sleeves up as there is a lot of work still to be done…

An approach to building open databases

- August 10, 2017 in Labs, Open Data

This post has been co-authored by Adam Kariv, Vitor Baptista, and Paul Walsh.
Open Knowledge International (OKI) recently coordinated a two-day work sprint as a way to touch base with partners in the Open Data for Tax Justice project. Our initial writeup of the sprint can be found here. Phase I of the project ended in February 2017 with the publication of What Do They Pay?, a white paper that outlines the need for a public database on the tax contributions and economic activities of multinational companies. The overarching goal of the sprint was to start some work towards such a database, by replicating data collection processes we’ve used in other projects, and to provide a space for domain expert partners to potentially use this data for some exploratory investigative work. We had limited time, a limited budget, and we are pleased with the discussions and ideas that came out of the sprint. One attendee, Tim Davies, criticised the approach we took in the technical stream of the sprint. The problem with the criticism is the extrapolation of one stream of activity during a two-day event to posit an entire approach to a project. We think exploration and prototyping should be part of any healthy project, and that is exactly what we did with our technical work in the two-day sprint. Reflecting on the discussion presents a good opportunity here to look more generally at how we, as an organisation, bring technical capacity to projects such as Open Data for Tax Justice. Of course, we often bring much more than technical capacity to a project, and Open Data for Tax Justice is no different in that regard, being mostly a research project to date. In particular, we’ll take a look at the technical approach we used for the two-day sprint. While this is not the only approach towards technical projects we employ at OKI, it has proven useful on projects driven by the creation of new databases.

An approach

Almost all projects that OKI either leads on, or participates in, have multiple partners. OKI generally participates in one of three capacities (sometimes, all three):
  • Technical design and implementation of open data platforms and apps.
  • Research and thought leadership on openness and data.
  • Dissemination and facilitating participation, often by bringing the “open data community” to interact with domain specific actors.
Only the first capacity is strictly technical, but each capacity does, more often than not, touch on technical issues around open data. Some projects have an important component around the creation of new databases targeting a particular domain. Open Data for Tax Justice is one such project, as are OpenTrials, and the Subsidy Stories project, which itself is a part of OpenSpending. While most projects have partners, usually domain experts, it does not mean that collaboration is consistent or equally distributed over the project life cycle. There are many reasons for this to be the case, such as the strengths and weaknesses of our team and those of our partners, priorities identified in the field, and, of course, project scope and funding. With this as the backdrop for projects we engage in generally, we’ll focus for the rest of this post on aspects when we bring technical capacity to a project. As a team (the Product Team at OKI), we are currently iterating on an approach in such projects, based on the following concepts:
  • Replication and reuse
  • Data provenance and reproducibility
  • Centralise data, decentralise views
  • Data wrangling before data standards
While not applicable to all projects, we’ve found this approach useful when contributing to projects that involve building a database to, ultimately, unlock the potential to use data towards social change.

Replication and reuse

We highly value the replication of processes and the reuse of tooling across projects. Replication and reuse enables us to reduce technical costs, focus more on the domain at hand, and share knowledge on common patterns across open data projects. In terms of technical capacity, the Product Team is becoming quite effective at this, with a strong body of processes and tooling ready for use. This also means that each project enables us to iterate on such processes and tooling, integrating new learnings. Many of these learnings come from interactions with partners and users, and others come from working with data. In the recent Open Data for Tax Justice sprint, we invited various partners to share experiences working in this field and try a prototype we built to extract data from country-by-country reports to a central database. It was developed in about a week, thanks to the reuse of processes and tools from other projects and contexts. When our partners started looking into this database, they had questions that could only be answered by looking back to the original reports. They needed to check the footnotes and other context around the data, which weren’t available in the database yet. We’ve encountered similar use cases in both and OpenTrials, so we can build upon these experiences to iterate towards a reusable solution for the Open Data for Tax Justice project. By doing this enough times in different contexts, we’re able to solve common issues quickly, freeing more time to focus on the unique challenges each project brings.

Data provenance and reproducibility

We think that data provenance, and reproducibility of views on data, is absolutely essential to building databases with a long and useful future. What exactly is data provenance? A useful definition from wikipedia is “… (d)ata provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins”. Depending on the way provenance is implemented in a project, it can also be a powerful tool for reproducibility of the data. Most work around open data at present does not consider data provenance and reproducibility as an essential aspect of working with open data. We think this is to the detriment of the ecosystem’s broader goals of seeing open data drive social change: the credible use of data from projects with no provenance or reproducibility built in to the creation of databases is significantly diminished in our “post truth” era. Our current approach builds data provenance and reproducibility right into the heart of building a database. There is a clear, documented record of every action performed on data, from the extraction of source data, through to normalisation processes, and right to the creation of records in a database. The connection between source data and processed data is not lost, and, importantly, the entire data pipeline can be reproduced by others. We acknowledge that a clear constraint of this approach, in its current form, is that it is necessarily more technical than, say, ad hoc extraction and manipulation with spreadsheets and other consumer tools used in manual data extraction processes. However, as such approaches make data provenance and reproducibility harder because there is no history of the changes made or where the data comes from, we are willing to accept this more technical approach and iterate on ways to reduce technical barriers. We hope to see more actors in the open data ecosystem integrating provenance and reproducibility right into their data work. Without doing so, we greatly reduce the ability for open data to be used in an investigative capacity, and likewise, we diminish the possibility of using the outputs of open data projects in the wider establishment of facts about the world. Recent work on beneficial ownership data takes a step in this direction, leveraging the PROV-DM standard to declare data provenance facts.

Centralise data, decentralise views

In OpenSpending, OpenTrials, and our initial exploratory work on Open Data for Tax Justice, there is an overarching theme to how we have approached data work, user stories and use cases, and co-design with domain experts: “centralise data, decentralise views”. Building a central database for open data in a given domain affords ways of interacting with such data that are extremely difficult, or impossible, by actively choosing to decentralise such data. Centralised databases make investigative work that uses the data easier, and allows for the discovery, for example, of patterns across entities and time that can be very hard to discover if data is decentralised. Additionally, by having in place a strong approach to data provenance and reproducibility, the complete replication of a centralised database is relatively easily done, and very much encouraged. This somewhat mitigates a major concern with centralised databases, being that they imply some type of “vendor lock-in”. Views on data are better when decentralised. By “views on data” we refer to visualisations, apps, websites – any user-facing presentation of data. While having data centralised potentially enables richer views, data almost always needs to be presented with additional context, localised, framed in a particular narrative, or otherwise presented in unique ways that will never be best served from a central point. Further, decentralised usage of data provides a feedback mechanism for iteration on the central database. For example, providing commonly used contextual data, establishing clear use cases for enrichment and reconciliation of measures and dimensions in the data, and so on.

Data wrangling before data standards

As a team, we are interested in, engage with, and also author, open data standards. However, we are very wary of efforts to establish a data standard before working with large amounts of data that such a standard is supposed to represent. Data standards that are developed too early are bound to make untested assumptions about the world they seek to formalise (the data itself). There is a dilemma here of describing the world “as it is”, or, “as we would like it to be”. No doubt, a “standards first” approach is valid in some situations. Often, it seems, in the realm of policy. We do not consider such an approach flawed, but rather, one with its own pros and cons. We prefer to work with data, right from extraction and processing, through to user interaction, before working towards public standards, specifications, or any other type of formalisation of the data for a given domain. Our process generally follows this pattern:
  • Get to know available data and establish (with domain experts) initial use cases.
  • Attempt to map what we do not know (e.g.: data that is not yet publicly accessible), as this clearly impacts both usage of the data, and formalisation of a standard.
  • Start data work by prescribing the absolute minimum data specification to use the data (i.e.: meet some or all of the identified use cases).
  • Implement data infrastructure that makes it simple to ingest large amounts of data, and also to keep the data specification reactive to change.
  • Integrate data from a wide variety of sources, and, with partners and users, work on ways to improve participation / contribution of data.
  • Repeat the above steps towards a fairly stable specification for the data.
  • Consider extracting this specification into a data standard.
Throughout this entire process, there is a constant feedback loop with domain expert partners, as well as a range of users interested in the data.


We want to be very clear that we do not think that the above approach is the only way to work towards a database in a data-driven project. Design (project design, technical design, interactive design, and so on) emerges from context. Design is also a sequence of choices, and each choice has an opportunity cost based on various constraints that are present in any activity. In projects we engage in around open databases, technology is a means to other, social ends. Collaboration around data is generally facilitated by technology, but we do not think the technological basis for this collaboration should be limited to existing consumer-facing tools, especially if such tools have hidden costs on the path to other important goals, like data provenance and reproducibility. Better tools and processes for collaboration will only emerge over time if we allow exploration and experimentation. We think it is important to understand general approaches to working with open data, and how they may manifest within a single project, or across a range of projects. Project work is not static, and definitely not reducible to snapshots of activity within a wider project life cycle. Certain approaches emphasise different ends. We’ve tried above to highlight some pros and cons of our approach, especially around data provenance and reproducibility, and data standards. In closing, we’d like to invite others interested in approaches to building open databases to engage in a broader discussion around these themes, as well as a discussion around short term and long term goals of such projects. From our perspective, we think there could be a great deal of value for the ecosystem around open data generally – CSOs, NGOs, governments, domain experts, funders – via a proactive discussion or series of posts with a multitude of voices. Join the discussion here if this is of interest to you.

Hallo, Code for Germany

- February 27, 2014 in civic tech, deutschlandweit, Featured, Hackday, Labs, offene Daten, Open Data, Open Knowledge Foundation, städte

Code for Germany Pünktlich zum international Open Data Day am Samstag haben wir ein neues Projekt gestartet: Code for Germany. Unser Ziel ist es, in verschiedenen deutschen Städten Teams von Entwicklern und Designern zusammenzubringen, die praktisch an Open Data- und Civic Tech-Projekten arbeiten. Diese “OK Labs” werden sich regelmäßig treffen, um Projekte mit lokalem Bezug voranzutreiben. Wir richten uns besonders an diese Gruppen, weil wir glauben, dass es wichtig ist, den Nutzen von Daten und den Bedarf nach weiteren Veröffentlichungen praktisch – anstatt nur in Gesprächskreisen theoretisch – aufzuzeigen. Uns ist auch klar, dass OKF in Deutschland bisher hauptsächlich in Berlin stattgefunden hat – das soll sich mit Code for Germany jetzt grundlegend ändern. Als starken Partner haben wir Code for America, die ihre ‘Brigades’ mittlerweile in 50 Städten in den USA und vielen anderen Ländern vorantreiben. Sie beraten uns beim Aufbau des Netzwerks und stellen uns ihre Materialien, Erfahrungen und ihre Öffentlichkeitsarbeit zur Verfügung. Finanziell unterstützt wird das Projekt durch Google. Wie kann man mitmachen? Die Pilotstädte für Code for Germany sind Hamburg, Berlin, Münster, Ulm, Heilbronn, Köln und Bremen – hier gibt es bereits ein Team, dass sich in Zukunft regelmäßig treffen wird. Nachdem wir so einige Erfahrungen gesammelt haben, wollen wir im Sommer eine zweite Gruppe von Labs starten. Egal wo ihr wohnt, ihr solltet euch jetzt bereits auf der Projektseite anmelden.

Introducing Open Knowledge Foundation Labs

- July 9, 2013 in Featured, Labs, News, OKF

Today we’re pleased to officially launch Open Knowledge Foundation Labs, a community home for civic hackers, data wranglers and anyone else intrigued and excited by the possibilities of combining technology and open information for good – making government more accountable, culture more accessible and science more efficient. Labs Labs is about “making” – whether that’s apps, insights or tools – using open data, open content and free / open source software. And you don’t need to be an uber-geek to participate: interest and a willingness to get your hands dirty (digitally), be that with making, testing or helping, is all that’s needed – although we do allow lurking on the mailing list ;-) Join in now! Sign up on the mailing list, follow us on twitter or read more about what we’re up to and how you can get involved.

Find out more

For the full picture of what Labs is up to, check out its projects page and the list of ideas for projects. Highlights include:
  • ReclineJS, a library for building data-driven web applications in pure JavaScript
  • Annotator, an open-source JavaScript library and tool that can be added to any webpage to make it annotatable
  • Nomenklatura, a simple service that makes it easy to maintain a canonical list of entities such as persons, companies or streets, and to match messy input against that list
  • PyBossa, a platform for crowd-sourcing online volunteer assistance on tasks that require human intelligence, which powers CrowdCrafting
Labs is part of the Open Knowledge Foundation Network and operates as a collaborative community, which anyone can join. You can also take a look at the: While some of you may have noticed we’ve been operating unannounced and somewhat under the radar for some time, we recently revamped the website and decided that it was high time to officially cut the ribbon and open our doors. If you’re interested in making things with open data or open content, we hope you’ll come and say hello.

What Do We Mean By Small Data

- April 26, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

Earlier this week we published the first in a series of posts on small data: “Forget Big Data, Small Data is the Real Revolution”. In this second in the series, we discuss small data in more detail providing a rough definition and drawing parallels with the history of computers and software. What do we mean by “small data”? Let’s define it crudely as:
“Small data is the amount of data you can conveniently store and process on a single machine, and in particular, a high-end laptop or server”
Why a laptop? What’s interesting (and new) right now is the democratisation of data and the associated possibility of large-scale distributed community of data wranglers working collaboratively. What matters here then is, crudely, the amount of data that an average data geek can handle on their own machine, their own laptop. A key point is that the dramatic advances in computing, storage and bandwidth have far bigger implications for “small data” than for “big data”. The recent advances have increased the realm of small data, the kind of data that an individual can handle on their own hardware, far more relatively than they have increased the realm of “big data”. Suddenly working with significant datasets – datasets containing tens of thousands, hundreds of thousands or millions of rows can be a mass-participation activity. (As should be clear from the above definition – and any recent history of computing – small (and big) are relative terms that change as technology advances – for example, in 1994 a terabyte of storage cost several hundred thousand dollars, today its under a hundred. This also means today’s big is tomorrow’s small). Our situation today is similar to microcomputers in the late 70s and early 80s or the Internet in the 90s. When microcomputers first arrived, they seemed puny in comparison to the “big” computing and “big” software then around and there was nothing strictly they could do that existing computing could not. However, they were revolutionary in one fundamental way: they made computing a mass-participation activity. Similarly, the Internet was not new in the 1990s – it had been around in various forms for several decades – but it was at that point it became available at a mass-scale to the average developer (and ultimately citizen). In both cases “big” kept on advancing too – be it supercomputers or the high-end connectivity – but the revolution came from “small”. This (small) data revolution is just beginning. The tools and infrastructure to enable effective collaboration and rapid scaling for small data are in their infancy, and the communities with the capacities and skills to use small data are in their early stages. Want to get involved in the small data forward revolution — sign up now
This is the second in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Frictionless Data: making it radically easier to get stuff done with data

- April 24, 2013 in Featured, Ideas and musings, Labs, Open Data, Open Standards, Small Data, Technical

Frictionless Data is now in alpha at – and we’d like you to get involved. Our mission is to make it radically easier to make data used and useful – our immediate goal is make it as simple as possible to get the data you want into the tool of your choice. This isn’t about building a big datastore or a data management system – it’s simply saving people from repeating all the same tasks of discovering a dataset, getting it into a format they can use, cleaning it up – all before they can do anything useful with it! If you’ve ever spent the first half of a hackday just tidying up tabular data and getting it ready to use, Frictionless Data is for you. Our work is based on a few key principles:
  • Narrow focus — improve one small part of the data chain, standards and tools are limited in scope and size
  • Build for the web – use formats that are web “native” (JSON) and work naturally with HTTP (plain-text, CSV is streamable etc)
  • Distributed not centralised — designed for a distributed ecosystem (no centralized, single point of failure or dependence)
  • Work with existing tools — don’t expect people to come to you, make this work with their tools and their workflows (almost everyone in the world can open a CSV file, every language can handle CSV and JSON)
  • Simplicity (but sufficiency) — use the simplest formats possible and do the minimum in terms of metadata but be sufficient in terms of schemas and structure for tools to be effective
We believe that making it easy to get and use data and especially open data is central to creating a more connected digital data ecosystem and accelerating the creation of social and commercial value. This project is about reducing friction in getting, using and connecting data, making it radically easier to get data you need into the tool of your choice. Frictionless Data distills much of our learning over the last 7 years into some specific standards and infrastructure.

What’s the Problem?

Today, when you decide to cook, the ingredients are readily available at local supermarkets or even already in your kitchen. You don’t need to travel to a farm, collect eggs, mill the corn, cure the bacon etc – as you once would have done! Instead, thanks to standard systems of measurement, packaging, shipping (e.g. containerization) and payment, ingredients can get from the farm direct to my local shop or even my door. But with data we’re still largely stuck at this early stage: every time you want to do an analysis or build an app you have to set off around the internet to dig up data, extract it, clean it and prepare it before you can even get it into your tool and begin your work proper. What do we need to do for the working with data to be like cooking today – where you get to spend your time making the cake (creating insights) not preparing and collecting the ingredients (digging up and cleaning data)? The answer: radical improvements in the “logistics” of data associated with specialisation and standardisation. In analogy with food we need standard systems of “measurement”, packaging, and transport so that its easy to get data from its original source into the application where you can start working with it. Frictionless DAta idea

What’s Frictionless Data going to do?

We start with an advantage: unlike for physical goods transporting digital information from one computer to another is very cheap! This means the focus can be on standardizing and simplifying the process of getting data from one application to another (or one form to another). We propose work in 3 related areas:
  • Key simple standards. For example, a standardized “packaging” of data that makes it easy to transport and use (think of the “containerization” revolution in shipping)
  • Simple tooling and integration – you should be able to get data in these standard formats into or out of Excel, R, Hadoop or whatever tool you use
  • Bootstrapping the system with essential data – we need to get the ball rolling
frictionless data components diagram

What’s Frictionless Data today?

1. Data

We have some exemplar datasets which are useful for a lot of people – these are:
  • High Quality & Reliable
    • We have sourced, normalized and quality checked a set of key reference datasets such as country codes, currencies, GDP and population.
  • Standard Form & Bulk Access
    • All the datasets are provided in a standardized form and can be accessed in bulk as CSV together with a simple JSON schema.
  • Versioned & Packaged
    • All data is in data packages and is versioned using git so all changes are visible and data can becollaboratively maintained.

2. Standards

We have two simple data package formats, described as ultra-lightweight, RFC-style specifications. They build heavily on prior work. Simplicity and practicality were guiding design criteria. Frictionless Data: package standard diagram Data package: minimal wrapping, agnostic about the data its “packaging”, designed for extension. This flexibility is good as it can be used as a transport for pretty much any kind of data but it also limits integration and tooling. Read the full Data Package specification. Simple data format (SDF): focuses on tabular data only and extends data package (data in simple data format is a data package) by requiring data to be “good” CSVs and the provision of a simple JSON-based schema to describe them (“JSON Table Schema”). Read the full Simple Data Format specification.

3. Tools

It’s early days for Frictionless Data, so we’re still working on this bit! But there’s a need for validators, schema generators, and all kinds of integration. You can help out – see below for details or check out the issues on github.

Doesn’t this already exist?

People have been working on data for a while – doesn’t something like this already exist? The crude answer is yes and no. People, including folks here at the Open Knowledge Foundation, have been working on this for quite some time, and there are already some parts of the solution out there. Furthermore, much of these ideas are directly borrowed from similar work in software. For example, the Data Packages spec (first version in 2007!) builds heavily on packaging projects and specifications like Debian and CommonJS. Key distinguishing features of Frictionless Data:
  • Ultra-simplicity – we want to keep things as simple as they possibly can be. This includes formats (JSON and CSV) and a focus on end-user tool integration, so people can just get the data they want into the tool they want and move on to the real task
  • Web orientation – we want an approach that fits naturally with the web
  • Focus on integration with existing tools
  • Distributed and not tied to a given tool or project – this is not about creating a central data marketplace or similar setup. It’s about creating a basic framework that would enable anyone to publish and use datasets more easily and without going through a central broker.
Many of these are shared with (and derive from) other approaches but as a whole we believe this provides an especially powerful setup.

Get Involved

This is a community-run project coordinated by the Open Knowledge Foundation as part of Open Knowledge Foundation Labs. Please get involved:
  • Spread the word! Frictionless Data is a key part of the real data revolution – follow the debate on #SmallData and share our posts so more people can get involved

Forget Big Data, Small Data is the Real Revolution

- April 22, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

There is a lot of talk about “big data” at the moment. For example, this is Big Data Week, which will see events about big data in dozens of cities around the world. But the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”. Big data smacks of the centralization fads we’ve seen in each computing era. The thought that ‘hey there’s more data than we can process!’ (something which is no doubt always true year-on-year since computing began) is dressed up as the latest trend with associated technology must-haves. Meanwhile we risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn’t about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data. Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of “big data”. Size in itself doesn’t matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have. For many problems and questions, small data in itself is enough. The data on my household energy use, the times of local buses, government spending – these are all small data. Everything processed in Excel is small data. When Hans Rosling shows us how to understand our world through population change or literacy he’s doing it with small data. And when we want to scale up the way to do that is through componentized small data: by creating and integrating small data “packages” not building big data monoliths, by partitioning problems in a way that works across people and organizations, not through creating massive centralized silos. This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data. Want to create the real data revolution? Come join our community creating the tools and materials to make it happen — sign up here:
This is the first in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Open Interests Europe Hackathon in London, 24-25 November

- October 15, 2012 in Data Journalism, Events, Labs, Open Data, Sprint / Hackday

The European Journalism Centre and the Open Knowledge Foundation invite you to the Open Interests Europe Hackathon to track the lobbyists’ interests and money flows which shape European policy. When: 24-25 November Where: Google Campus Cafe, 4-5 Bonhill Street, EC2A 4BX London How EU money is spent is an issue that concerns everyone who pays taxes to the EU. As the influence of Brussels lobbyists grows, it is increasingly important to draw the connections between lobbying, policy-making and funding. Journalists and activists need browsable databases, tools and platforms to investigate lobbyists’ influence and where the money goes in the EU. Join us and help build these tools! Open Interests Europe brings together developers, designers, activists, journalists and other geeks for two days of collaboration, learning, fun, intense hacking and app building.

The Lobby Transparency Challenge

Within any political process there are many interests wanting to be heard – companies, trade unions, NGOs – and Brussels is no exception. Corporate Europe Observatory, Friends of the Earth Europe and LobbyControl have begun to data-mine the lobby registers of the European Commission and of the European Parliament to find out who the lobbyists are, what they want and how much they are investing. You will have the exclusive opportunity to work with this data before it is made public in their upcoming portal. What can you do with this data? Group leader: Erik Wesselius is one of the co-founders of Corporate Europe Observatory. In the past few years, Erik has focused on issues related to lobbying transparency and regulation as well as EU economic governance. In 2005, Erik was active in the Dutch campaign for a No against the EU Constitution.

The Fish Subsidies Challenge

Subsidies paid to owners of fishing vessels and others working in the fishing industry under the European Union’s common fisheries policy amount to approximately €1 billion a year. EU Transparency gathered detailed data relating to payments and recipients of fisheries subsidies in every EU member state from multiple sources, from European Commission databases to member state government databases and inter-governmental fishery organizations such as ICCAT. What can you do with this data? Group leader: Jack Thurston is policy analyst, activist, writer and broadcaster. He is co-founder of, winner of a Freedom of Information Award from Investigative Reporters and Editors.

Prizes and Jury

All participants will get the satisfaction of contributing to a cause that affects us all! Not only that, the winning team will be awarded a 100 EUR Amazon voucher, pre-ordered copies of the movie The Brus$els Business – Who Runs the European Union? (to be released this autumn) and copies of The Data Journalism Handbook. The Jury members are Rufus Pollock, co-Founder and Director of the Open Knowledge Foundation and Alastair Dant, Lead Interactive Technologist for the Guardian. For more details at the event’s webpage: Please register for the event at Eventbrite:

If you have any questions or would like to submit a challenge around this topic, please contact: sprints [at]

This event is organised by:

OKFN_EJC Supported by Mozilla

Ignite Cleanweb

- September 12, 2012 in Events, External, Labs, Meetups, WG Economics

Ignite Cleanweb

Ignite Event in London

This Thursday in London, Cleanweb UK invites you to their first Ignite evening, hosted by Forward Technology. Come along and see a great lineup of lightning talks, all about what’s happening with sustainability and the web in the UK. From clean clouds, to home energy, to climate visualisation, there will plenty to learn, and plenty of other attendees to get to know. It’ll be an evening to remember, so make sure you’re there! Sign up on the Cleanweb UK website. Confirmed lighting talks:
  • Loco2 vs The European Rail Booking Monster, Jon Leighton, Loco2
  • Love Thy Neighbour. Rent Their Car, Tom Wright, Whipcar
  • Solar Panels Cross The Chasm, Jason Neylon, uSwitch
  • Weaponising Environmentalism, Chris Adams, AMEE
  • Energy Saving Behaviour – The Motivation Challenge, Paul Tanner, Virtual Technologies
  • Good Food, For Everyone, Forever. Easy, Right?, Ed Dowding, Sustaination
  • The Open Energy Monitor Project, Glyn Hudson & Tristan Lea, OpenEnergyMonitor
  • The Carbon Map, Robin Houston, Carbon Map
  • Putting the Local in Global Warming with Open Data, Jack Townsend, Globe Town
  • Cleanweb in the UK, James Smith, Cleanweb UK
and more… Cleanweb community

Cleanweb Community London

There is a movement growing. Bit by bit, developers are using the power of the web to make our world more sustainable. Whether by improving the way we travel, the way we eat, or the way we use energy, the web is making a difference. The Cleanweb movement is building a global conversation, with local chapters running hackdays and meetups to get people together. Here in the UK, we’ve been doing this longer than anyone else. Cleanweb-style projects were emerging in 2007, with 2008′s geeKyoto conference bringing together a lot of early efforts. It’s only really appropriate then that we have the most active Cleanweb community in the world, in the form of Cleanweb London. With over 150 members, it’s a great base, on which we’re building a wider Cleanweb UK movement. We’ve run a hackday, have regular meetups, and are building towards our first Ignite Cleanweb evening. This is an expanding community, made of many different projects and groups, and one that has a chance to do some real good. If you’d like to be part of it, or if you already are but didn’t know it, come along to a meetup and get involved! Cleanweb MeetUp

OpenDataMx: Opening Up the Government, one Bit at a Time

- September 4, 2012 in Chapters, Events, External, Featured, Featured Project, Labs, Open Access, Open Content, Open Data, Open Economics, Open Spending, Policy, School of Data, Sprint / Hackday

On August 24-25, another edition of OpenDataMx took place: a 36-hour public data hackathon for the development of creative technological solutions to questions raised by the civil society. This time the event was hosted by the University of Communication in Mexico City. The popularity of the event has grown: a total of 63 participants including coders and designers took part and another 58 representatives from civil society from more than ten different organisations attended the parallel conference. Government institutions participated actively as well: the Ministry of Finance and Public Credit, IFAI and the Government of the Oaxaca State. The workshops were about technology, open data and its potential in the search for technological solutions to the problems of civil society. The following proposals resulted from the discussions in the conference:
  • Construct a methodology to collectively generate open data from civil society for reuse in data events as well as to demonstrate benefits of government bodies to adopt the practice of generating their data openly.
  • The collective construction of a common database of information and knowledge on the topic of open data through the wiki of OpenDataMx.
After 36 hours continuous work, each of the 23 teams presented their project, each based on the 30 datasets, provided by both the government and civil society organisations. As currently little open government data is available, the joint work of civil society was essential in order to realise the hackathon. Read the Hackathon news in Spanish on the OpenDataMx blog here. OpenDataMx1 The judging panel responsible for assessing the projects was comprised of recognised experts in technology, open data and its application to civil society needs. The panel consisted of Velichka Dimitrova (Open Knowledge Foundation), Matioc Anca (Fundación Ciudadano Inteligente), Eric Mill (Sunlight Foundation) and Jorge Soto from Citivox. The first three projects were awarded money prizes ($30 000, $20 000 and $10 000 Mexican pesos respectively), allowing the teams to implement their project. An honorary mention was given to the project of the Government of the Oaxaca State and the Finance Ministry (SHCP) about the transparency of public works and citizen participation. The organisers of the hackathon also tried to link all teams to the institution or organisation relevant to their project in order to get support and advise for further steps. The organisers: Fundar, the Centre for Analysis and Investigations, SocialTIC; Colectivo por la Transparencia and the University of Communication would like to thank all participants, judges and speakers for their enthusiasm and valuable support in building the citizen community. OpenDataMX2 Here are some details about the winning projects:
FIRST PLACE Name of the Project: Becalia | General Description: A platform, allowing firms and civil society to sponsor students with limited economic means to continue their higher education. Background to the problem: There are very few students who receive a government scholarship for higher education. Additionally, few students decide to continue their education to a higher level, less than 20% in all states. The idea is to support the students who do not have the means and enable the participation of civil society. Technology and tools used: Ruby on Rails, Javascript, CoffeeScript Datasets: PRONABES (Programa Nacional de Becas para la Educación Superior) – National Scholarship Program for Higher Education Team members: Adrián González, Abraham Kuri, Javier Ayala, Eduardo López
SECOND PLACE Name of the Project: Más inversión para movernos mejor (More investment for better movement) | General Description: A small website for citizen participation, where users are asked to allocating spending to a type of urban mobility e.g. cars, public transport of bicycles, signalling their preference on where they would like the government to invest. After assigning one’s preferences, the users can compare them with the actual spending of the government and are offered multimedia material informing them about the topic. Background to the problem: There is lack of information on how the government spends the money and the importance of sustainable urban mobility. Technology and tools used: HTML, Javascript, PHP, Codeigniter, Bootstrap, Excel and SQL Datasets: Base de datos del Instituto de Políticas para el Transporte y el Desarrollo -ITDP (Database of the Policy Institute for Transport and Development) Team members: Antonio Sandoval, Jorge Cravioto, Said Villegas, Jorge Cáñez
THIRD PLACE Name of the Project: DiputadoMx | General Description: An application that helps you find your representative by geographical area, political party, gender or commission he or she belongs to. The application is compatible with desktop and mobile technology. Background to the problem: Lack of opportunity for citizens to communicate directly with their representatives. Technology and tools used: 
HTML5, CSS3, JQUERY, PYTHON , GOOGLE APP ENGINE, MONGODB Datasets: Base de datos del IFE del diputados (IFE Database of MPs) Team members: Pedro Aron Barrera Almaraz
HONORARY MENTION: Name of the Project: Obra Pública Abierta (Open Public Works) General Description: Open Public Works is an open government tool, conceptualised and developed by the Government of the Oaxaca State and Ministry of Finance (SHCP). This platform is created in order to make public works more transparent, presenting them in a simpler language and encouraging citizen oversight from the users community. Open Public Works seeks to create state transparency policy of the 3rd generation in the three levels of governance. This open source platform is also meant as a public good that will be delivered to the various state governments to promote nationwide transparency, citizen participation, and accountability in the public works sector. Background to the Problem: There is lack of transparency in the infrastructure funds spending by the state governments. The citizen is not familiar with basic information about public works realised in their community and no mechanisms for independent social audit exist. Moreover, state control bodies lack the ability to control and supervise all public works. Public participation in the control of public resources is essential to solve this situation, where society and government should work together. Additionally, there is no public policy cross all three levels of government for the transparency of this sector. Finally, the public lacks too§ls and incentives to monitor, report and, if necessary, denounce the use of public resources in this very nontransparent government sector. Technology and tools used:  API de Google Maps V.2, PHP,  JavaScript y Jquery Datasets:
 Data set de obra pública de la SHCP y SINFRA/SEFIN del Gobierno de Oaxaca (Datesets of piblic works of SHCP and SINFRA/SEFIN of the Government of Oaxaca). Team Members: Berenice Hernández Sumano, Juan Carlos Ayuso Bautista, Tarick Gracida Sumano, José Antonio García Morales, Lorena Rivero, Roberto Moreno Herrera, Luis Fernando Ostria
For more information: Photos and content thanks to Federico Ramírez and Fundar.