You are browsing the archive for Small Data.

International Data Week: From Big Data to Open Data

- October 11, 2016 in Frictionless Data, Open Data, Open Knowledge, Open Research, Open Science, Open Standards, Small Data

Report from International Data Week: Research needs to be reproducible, data needs to be reusable and Data Packages are here to help. International Data Week has come and gone. The theme this year was ‘From Big Data to Open Data: Mobilising the Data Revolution’. Weeks later, I am still digesting all the conversations and presentations (not to mention, bagels) I consumed over its course. For a non-researcher like me, it proved to be one of the most enjoyable conferences I’ve attended with an exciting diversity of ideas on display. In this post, I will reflect on our motivations for attending, what we did, what we saw, and what we took back home. idw

Three conferences on research data

International Data Week (11-17 September) took place in Denver, Colorado and consisted of three co-located events: SciDataCon, International Data Forum, and the Research Data Alliance (RDA) 8th Plenary. Our main motivation for attending these events was to talk directly with researchers about Frictionless Data, our project oriented around tooling for working with Data “Packages”, an open specification for bundling related data together using a standardized JSON-based description format. The concepts behind Frictionless Data were developed through efforts at improving workflows for publishing open government data via CKAN. Thanks to a generous grant from the Sloan Foundation, we now have the ability to take what we’ve learned in civic tech and pilot this approach within various research communities. International Data Week provided one the best chances we’ve had so far to meet researchers attempting to answer today’s most significant challenges in managing research data. It was time well spent: over the week I absorbed interesting user stories, heard clearly defined needs, and made connections which will help drive the work we do in the months to come.

What are the barriers to sharing research data?

While our aim is to reshape how researchers share data through better tooling and specifications, we first needed to understand what non-technical factors might impede that sharing. On Monday, I had the honor to chair the second half of a session co-organized by Peter Fitch, Massimo Craglia, and Simon Cox entitled Getting the incentives right: removing social, institutional and economic barriers to data sharing. During this second part, Wouter Haak, Heidi Laine, Fiona Murphy, and Jens Klump brought their own experiences to bear on the subject of what gets in the way of data sharing in research. _MG_3504 Mr. Klump considered various models that could explain why and under what circumstances researchers might be keen to share their data—including research being a “gift culture” where materials like data are “precious gifts” to be paid back in kind—while Ms. Laine presented a case study directly addressing a key disincentive for sharing data: fears of being “scooped” by rival researchers. One common theme that emerged across talks was the idea that making it easier to credit researchers for their data via an enabling environment for data citation might a be a key factor in increasing data sharing. An emerging infrastructure for citing datasets via DOIs (Digital Object Identifiers) might be part of this. More on this later.

“…making it easier to credit researchers for their data via an enabling environment for data citation might a be a key factor in increasing data sharing”

What are the existing standards for research data?

For the rest of the week, I dove into the data details as I presented at sessions on topics like “semantic enrichment, metadata and data packaging”, “Data Type Registries”, and the “Research data needs of the Photon and Neutron Science community”. These sessions proved invaluable as they put me in direct contact with actual researchers where I learned about the existence (or in some cases, non-existence) of community standards for working with data as well as some of the persistent challenges. For example, the Photon and Neutron Science community has a well established standard in NeXus for storing data, however some researchers highlighted an unmet need for a lightweight solution for packaging CSVs in a standard way. Other researchers pointed out the frustrating inability of common statistical software packages like SPSS to export data into a high quality (e.g. with all relevant metadata) non-proprietary format as encouraged by most data management plans. And, of course, a common complaint throughout was the amount of valuable research data locked away in Excel spreadsheets with no easy way to package and publish them. These are key areas we are addressing now and in the coming months with Data Packages.

Themes and take-home messages

The motivating factor behind much of the infrastructure and standardization work presented was the growing awareness of the need to make scientific research more reproducible, with the implicit requirement that research data itself be more reusable. Fields as diverse as psychology and archaeology have been experiencing a so-called “crisis” of reproducibility. For a variety of reasons, researchers are failing to reproduce findings from their own or others’ experiments. In an effort to resolve this, concepts like persistent identifiers, controlled vocabularies, and automation played a large role in much of the current conversation I heard.

…the growing awareness of the need to make scientific research more reproducible, with the implicit requirement that research data itself be more reusable”

_MG_3511

Persistent Identifiers

Broadly speaking, persistent identifiers (PIDs) are an approach to creating a reference to a digital “object” that (a) stays valid over long periods of time and (b) is “actionable”, that is, machine-readable. DOIs, mentioned above and introduced in 2000, are a familiar approach to persistently identifying and citing research articles, but there is increasing interest in applying this approach at all levels of the research process from researchers themselves (through ORCID) to research artifacts and protocols, to (relevant to our interests) datasets. We are aware of the need to address this use case and, in coordination with our new Frictionless Data specs working group, we are working on an approach to identifiers on Data Packages.

Controlled Vocabularies

Throughout the conference, there was an emphasis on ensuring that records in published data incorporate some idea of semantic meaning, that is, making sure that two datasets that use the same term or measurement actually refer to the same thing by enforcing the use of a shared vocabulary. Medical Subject Headings (MeSH) from the United States National Library of Medicine is a good example of a standard vocabulary that many datasets use to consistently describe biomedical information. While Data Packages currently do not support specifying this type of semantic information in a dataset, the specification is not incompatible with this approach. As an intentionally lightweight publishing format, our aim is to keep the core of the specification as simple as possible while allowing for specialized profiles that could support semantics.

Automation

There was a lot of talk about increasing automation around data publishing workflows. For instance, there are efforts to create “actionable” Data Management Plans that help researchers walk through describing, publishing and archiving their data. A core aim of the Frictionless Data tooling is to automate as many elements of the data management process as possible. We are looking to develop simple tools and documentation for preparing datasets and defining schemas for different types of data so that the data can, for instance, be automatically validated according to defined schemas.

Making Connections

Of course, one of the major benefits of attending any conference was the chance to meet and interact with other research projects. For instance, we had really great conversations with Mackenzie DataStream project, a really amazing project for sharing and exploring water data in the Mackenzie River Basin in Canada. The technology behind this project already uses the Data Packages specifications, so look for a case study on the work done here on the Frictionless Data site soon. img_0350 There is never enough time in one conference to meet all the interesting people and explore all the potential opportunities for collaboration. If you are interested in learning more about our Frictionless Data project or would like to get involved, check out the links below. We’re always looking for new opportunities to pilot our approach. Together, hopefully, we can make reduce the friction in managing research data.

Git (and Github) for Data

- July 2, 2013 in Featured, Ideas and musings, Open Data, Small Data, Technical

The ability to do “version control” for data is a big deal. There are various options but one of the most attractive is to reuse existing tools for doing this with code, like git and mercurial. This post describes a simple “data pattern” for storing and versioning data using those tools which we’ve been using for some time and found to be very effective.

Introduction

The ability to do revisioning and versioning data – store changes made and share them with others – especially in a distributed way would be a huge benefit to the (open) data community. I’ve discussed why at some length before (see also this earlier post) but to summarize:
  • It allows effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once!)
  • It allows one to track provenance better (i.e. what changes came from where)
  • It allows for sharing updates and synchronizing datasets in a simple, effective, way – e.g. an automated way to get the last months GDP or employment data without pulling the whole file again
There are several ways to address the “revision control for data” problem. The approach here is to get data in a form that means we can take existing powerful distributed version control systems designed for code like git and mercurial and apply them to the data. As such, the best github for data may, in fact, be github (of course, you may want to layer data-specific interfaces on on top of git(hub) – this is what we do with http://data.okfn.org/). There are limitations to this approach and I discuss some of these and alternative models below. In particular, it’s best for “small (or even micro) data” – say, under 10Mb or 100k rows. (One alternative model can be found in the very interesting Dat project recently started by Max Ogden — with whom I’ve talked many times on this topic). However, given the maturity and power of the tooling – and its likely evolution – and the fact that so much data is small we think this approach is very attractive.

The Pattern

The essence of the pattern is:
  1. Storing data as line-oriented text and specifically as CSV1 (comma-separated variable) files. “Line oriented text” just indicates that individual units of the data such as a row of a table (or an individual cell) corresponds to one line2.

  2. Use best of breed (code) versioning like git mercurial to store and manage the data.

Line-oriented text is important because it enables the powerful distributed version control tools like git and mercurial to work effectively (this, in turn, is because those tools are built for code which is (usually) line-oriented text). It’s not just version control though: there is a large and mature set of tools for managing and manipulating these types of files (from grep to Excel!). In addition to the basic pattern, there are several a few optional extras you can add:

  • Store the data in GitHub (or Gitorious or Bitbucket or …) – all the examples below follow this approach
  • Turn the collection of data into a Simple Data Format data package by adding a datapackage.json file which provides a small set of essential information like the license, sources, and schema (this column is a number, this one is a string)
  • Add the scripts you used to process and manage data — that way everything is nicely together in one repository

What’s good about this approach?

The set of tools that exists for managing and manipulating line-oriented files is huge and mature. In particular, powerful distributed version control systems like git and mercurial are already extremely robust ways to do distributed, peer-to-peer collaboration around code, and this pattern takes that model and makes it applicable to data. Here are some concrete examples of why its good.

Provenance tracking

Git and mercurial provide a complete history of individual contributions with “simple” provenance via commit messages and diffs. Example of commit messages

Peer-to-peer collaboration

Forking and pulling data allows independent contributors to work on it simultaneously. Timeline of pull requests

Data review

By using git or mercurial, tools for code review can be repurposed for data review. Pull screen

Simple packaging

The repo model provides a simple way to store data, code, and metadata in a single place. A repo for data

Accessibility

This method of storing and versioning data is very low-tech. The format and tools are both very mature and are ubiquitous. For example, every spreadsheet and every relational database can handle CSV. Every unix platform has a suite of tools like grep, sed, cut that can be used on these kind of files.

Examples

We’ve been using with this approach for a long-time: in 2005 we first stored CSV’s in subversion, then in mercurial, and then when we switched to git (and github) 3 years ago we started storing them there. In 2011 we started the datasets organization on github which contains a whole list of of datasets managed according to the pattern above. Here are a couple of specific examples: Note Most of these examples not only show CSVs being managed in github but are also simple data format data packages – see the datapackage.json they contain.

Appendix

Limitations and Alternatives

Line-oriented text and its tools are, of course, far from perfect solutions to data storage and versioning. They will not work for datasets of every shape and size, and in some respects they are awkward tools for tracking and merging changes to tabular data. For example:
  • Simple actions on data stored as line-oriented text can lead to a very large changeset. For example, swapping the order of two fields (= columns) leads to a change in every single line. Given that diffs, merges, etc. are line-oriented, this is unfortunate.3
  • It works best for smallish data (e.g. < 100k rows, < 50mb files, optimally < 5mb files). git and mercurial don’t handle big files that well, and features like diffs get more cumbersome with larger files.4
  • It works best for data made up of lots of similar records, ideally tabular data. In order for line-oriented storage and tools to be appropriate, you need the record structure of the data to fit with the CSV line-oriented structure. The pattern is less good if your CSV is not very line-oriented (e.g. you have a lot of fields with line breaks in them), causing problems for diff and merge.
  • CSV lacks a lot of information, e.g. information on the types of fields (everything is a string). There is no way to add metadata to a CSV without compromising its simplicity or making it no longer usable as pure data. You can, however, add this kind of information in a separate file, and this exactly what the Data Package standard provides with its datapackage.json file.
The most fundamental limitations above all arise from applying line-oriented diffs and merges to structured data whose atomic unit is not a line (its a cell, or a transform of some kind like swapping two columns) The first issue discussed below, where a simple change to a table is treated as a change to every line of the file, is a clear example. In a perfect world, we’d have both a convenient structure and a whole set of robust tools to support it, e.g. tools that recognize swapping two columns of a CSV as a single, simple change or that work at the level of individual cells. Fundamentally a revision system is built around a diff format and a merge protocol. Get these right and much of the rest follows. The basic 3 options you have are: * Serialize to line-oriented text and use the great tools like git (what’s we’ve described above) * Identify atomic structure (e.g. document) and apply diff at that level (think CouchDB or standard copy-on-write for RDBMS at row level) * Recording transforms (e.g. Refine) At the Open Knowledge Foundation we built a system along the lines of (2) and been involved in exploring and researching both (2) and (3) – see changes and syncing for data on on dataprotocols.org. These options are definitely worth exploring — and, for example, Max Ogden, with whom I’ve had many great discussions on this topic, is currently working on an exciting project called Dat, a collaborative data tool which will use the “sleep” protocol. However, our experience so far is that the line-oriented approach beats any currently available options along those other lines (at least for smaller sized files!).

data.okfn.org

Having already been storing data in github like this for several years, we recently launched http://data.okfn.org/ which is explicitly based on this approach:
  • Data is CSV stored in git repos on GitHub at https://github.com/datasets
  • All datasets are data packages with datapackage.json metadata
  • Frontend site is ultra-simple – it just provides catalog and API and pulls data directly from github

Why line-oriented

Line-oriented text is the natural form of code and so is supported by a huge number of excellent tools. But line-oriented text is also the simplest and most parsimonious form for storing general record-oriented data—and most data can be turned into records. At its most basic, structured data requires a delimiter for fields and a delimiter for records. Comma- or tab-separated values (CSV, TSV) files are a very simple and natural implementation of this encoding. They delimit records with the most natural separation character besides the space, the line break. For a field delimiter, since spaces are too common in values to be appropriate, they naturally resort to commas or tabs. Version control systems require an atomic unit to operate on. A versioning system for data can quite usefully treat records as the atomic units. Using line-oriented text as the encoding for record-oriented data automatically gives us a record-oriented versioning system in the form of existing tools built for versioning code.

  1. Note that, by CSV, we really mean “DSV”, as the delimiter in the file does not have to be a comma. However, the row terminator should be a line break (or a line break plus carriage return). 
  2. CSVs do not always have one row to one line (it is possible to have line-breaks in a field with quoting). However, most CSVs are one-row-to-one-line. CSVs are pretty much the simplest possible structured data format you can have. 
  3. As a concrete example, the merge function will probably work quite well in reconciling two sets of changes that affect different sets of records, hence lines. Two sets of changes which each move a column will not merge well, however. 
  4. For larger data, we suggest swapping out git (and e.g. GitHub) for simple file storage like s3. Note that s3 can support basic copy-on-write versioning. However, being copy-on-write, it is comparatively very inefficient. 

What Do We Mean By Small Data

- April 26, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

Earlier this week we published the first in a series of posts on small data: “Forget Big Data, Small Data is the Real Revolution”. In this second in the series, we discuss small data in more detail providing a rough definition and drawing parallels with the history of computers and software. What do we mean by “small data”? Let’s define it crudely as:
“Small data is the amount of data you can conveniently store and process on a single machine, and in particular, a high-end laptop or server”
Why a laptop? What’s interesting (and new) right now is the democratisation of data and the associated possibility of large-scale distributed community of data wranglers working collaboratively. What matters here then is, crudely, the amount of data that an average data geek can handle on their own machine, their own laptop. A key point is that the dramatic advances in computing, storage and bandwidth have far bigger implications for “small data” than for “big data”. The recent advances have increased the realm of small data, the kind of data that an individual can handle on their own hardware, far more relatively than they have increased the realm of “big data”. Suddenly working with significant datasets – datasets containing tens of thousands, hundreds of thousands or millions of rows can be a mass-participation activity. (As should be clear from the above definition – and any recent history of computing – small (and big) are relative terms that change as technology advances – for example, in 1994 a terabyte of storage cost several hundred thousand dollars, today its under a hundred. This also means today’s big is tomorrow’s small). Our situation today is similar to microcomputers in the late 70s and early 80s or the Internet in the 90s. When microcomputers first arrived, they seemed puny in comparison to the “big” computing and “big” software then around and there was nothing strictly they could do that existing computing could not. However, they were revolutionary in one fundamental way: they made computing a mass-participation activity. Similarly, the Internet was not new in the 1990s – it had been around in various forms for several decades – but it was at that point it became available at a mass-scale to the average developer (and ultimately citizen). In both cases “big” kept on advancing too – be it supercomputers or the high-end connectivity – but the revolution came from “small”. This (small) data revolution is just beginning. The tools and infrastructure to enable effective collaboration and rapid scaling for small data are in their infancy, and the communities with the capacities and skills to use small data are in their early stages. Want to get involved in the small data forward revolution — sign up now
This is the second in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.

Frictionless Data: making it radically easier to get stuff done with data

- April 24, 2013 in Featured, Ideas and musings, Labs, Open Data, Open Standards, Small Data, Technical

Frictionless Data is now in alpha at http://data.okfn.org/ – and we’d like you to get involved. Our mission is to make it radically easier to make data used and useful – our immediate goal is make it as simple as possible to get the data you want into the tool of your choice. This isn’t about building a big datastore or a data management system – it’s simply saving people from repeating all the same tasks of discovering a dataset, getting it into a format they can use, cleaning it up – all before they can do anything useful with it! If you’ve ever spent the first half of a hackday just tidying up tabular data and getting it ready to use, Frictionless Data is for you. Our work is based on a few key principles:
  • Narrow focus — improve one small part of the data chain, standards and tools are limited in scope and size
  • Build for the web – use formats that are web “native” (JSON) and work naturally with HTTP (plain-text, CSV is streamable etc)
  • Distributed not centralised — designed for a distributed ecosystem (no centralized, single point of failure or dependence)
  • Work with existing tools — don’t expect people to come to you, make this work with their tools and their workflows (almost everyone in the world can open a CSV file, every language can handle CSV and JSON)
  • Simplicity (but sufficiency) — use the simplest formats possible and do the minimum in terms of metadata but be sufficient in terms of schemas and structure for tools to be effective
We believe that making it easy to get and use data and especially open data is central to creating a more connected digital data ecosystem and accelerating the creation of social and commercial value. This project is about reducing friction in getting, using and connecting data, making it radically easier to get data you need into the tool of your choice. Frictionless Data distills much of our learning over the last 7 years into some specific standards and infrastructure.

What’s the Problem?

Today, when you decide to cook, the ingredients are readily available at local supermarkets or even already in your kitchen. You don’t need to travel to a farm, collect eggs, mill the corn, cure the bacon etc – as you once would have done! Instead, thanks to standard systems of measurement, packaging, shipping (e.g. containerization) and payment, ingredients can get from the farm direct to my local shop or even my door. But with data we’re still largely stuck at this early stage: every time you want to do an analysis or build an app you have to set off around the internet to dig up data, extract it, clean it and prepare it before you can even get it into your tool and begin your work proper. What do we need to do for the working with data to be like cooking today – where you get to spend your time making the cake (creating insights) not preparing and collecting the ingredients (digging up and cleaning data)? The answer: radical improvements in the “logistics” of data associated with specialisation and standardisation. In analogy with food we need standard systems of “measurement”, packaging, and transport so that its easy to get data from its original source into the application where you can start working with it. Frictionless DAta idea

What’s Frictionless Data going to do?

We start with an advantage: unlike for physical goods transporting digital information from one computer to another is very cheap! This means the focus can be on standardizing and simplifying the process of getting data from one application to another (or one form to another). We propose work in 3 related areas:
  • Key simple standards. For example, a standardized “packaging” of data that makes it easy to transport and use (think of the “containerization” revolution in shipping)
  • Simple tooling and integration – you should be able to get data in these standard formats into or out of Excel, R, Hadoop or whatever tool you use
  • Bootstrapping the system with essential data – we need to get the ball rolling
frictionless data components diagram

What’s Frictionless Data today?

1. Data

We have some exemplar datasets which are useful for a lot of people – these are:
  • High Quality & Reliable
    • We have sourced, normalized and quality checked a set of key reference datasets such as country codes, currencies, GDP and population.
  • Standard Form & Bulk Access
    • All the datasets are provided in a standardized form and can be accessed in bulk as CSV together with a simple JSON schema.
  • Versioned & Packaged
    • All data is in data packages and is versioned using git so all changes are visible and data can becollaboratively maintained.

2. Standards

We have two simple data package formats, described as ultra-lightweight, RFC-style specifications. They build heavily on prior work. Simplicity and practicality were guiding design criteria. Frictionless Data: package standard diagram Data package: minimal wrapping, agnostic about the data its “packaging”, designed for extension. This flexibility is good as it can be used as a transport for pretty much any kind of data but it also limits integration and tooling. Read the full Data Package specification. Simple data format (SDF): focuses on tabular data only and extends data package (data in simple data format is a data package) by requiring data to be “good” CSVs and the provision of a simple JSON-based schema to describe them (“JSON Table Schema”). Read the full Simple Data Format specification.

3. Tools

It’s early days for Frictionless Data, so we’re still working on this bit! But there’s a need for validators, schema generators, and all kinds of integration. You can help out – see below for details or check out the issues on github.

Doesn’t this already exist?

People have been working on data for a while – doesn’t something like this already exist? The crude answer is yes and no. People, including folks here at the Open Knowledge Foundation, have been working on this for quite some time, and there are already some parts of the solution out there. Furthermore, much of these ideas are directly borrowed from similar work in software. For example, the Data Packages spec (first version in 2007!) builds heavily on packaging projects and specifications like Debian and CommonJS. Key distinguishing features of Frictionless Data:
  • Ultra-simplicity – we want to keep things as simple as they possibly can be. This includes formats (JSON and CSV) and a focus on end-user tool integration, so people can just get the data they want into the tool they want and move on to the real task
  • Web orientation – we want an approach that fits naturally with the web
  • Focus on integration with existing tools
  • Distributed and not tied to a given tool or project – this is not about creating a central data marketplace or similar setup. It’s about creating a basic framework that would enable anyone to publish and use datasets more easily and without going through a central broker.
Many of these are shared with (and derive from) other approaches but as a whole we believe this provides an especially powerful setup.

Get Involved

This is a community-run project coordinated by the Open Knowledge Foundation as part of Open Knowledge Foundation Labs. Please get involved:
  • Spread the word! Frictionless Data is a key part of the real data revolution – follow the debate on #SmallData and share our posts so more people can get involved

Forget Big Data, Small Data is the Real Revolution

- April 22, 2013 in Featured, Ideas and musings, Labs, Open Data, Small Data

There is a lot of talk about “big data” at the moment. For example, this is Big Data Week, which will see events about big data in dozens of cities around the world. But the discussions around big data miss a much bigger and more important picture: the real opportunity is not big data, but small data. Not centralized “big iron”, but decentralized data wrangling. Not “one ring to rule them all” but “small pieces loosely joined”. Big data smacks of the centralization fads we’ve seen in each computing era. The thought that ‘hey there’s more data than we can process!’ (something which is no doubt always true year-on-year since computing began) is dressed up as the latest trend with associated technology must-haves. Meanwhile we risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn’t about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data. Just as we now find it ludicrous to talk of “big software” – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of “big data”. Size in itself doesn’t matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have. For many problems and questions, small data in itself is enough. The data on my household energy use, the times of local buses, government spending – these are all small data. Everything processed in Excel is small data. When Hans Rosling shows us how to understand our world through population change or literacy he’s doing it with small data. And when we want to scale up the way to do that is through componentized small data: by creating and integrating small data “packages” not building big data monoliths, by partitioning problems in a way that works across people and organizations, not through creating massive centralized silos. This next decade belongs to distributed models not centralized ones, to collaboration not control, and to small data not big data. Want to create the real data revolution? Come join our community creating the tools and materials to make it happen — sign up here:
This is the first in a series of posts about the power of Small Data – follow the Open Knowledge Foundation blog, Twitter or Facebook to learn more and join the debate at #SmallData on Twitter.