You are browsing the archive for Frictionless Data.

Frictionless Data Case Study: OpenML

Joaquin Vanschoren - December 6, 2017 in case study, Data Package, Frictionless Data, Open Source

The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software. The Frictionless Data  case study series highlights projects and organisations who are working with Frictionless Data specifications and software in interesting and innovative ways. OpenML is one such organization. This case study has been made possible by OpenML’s Heidi Seibold and Joaquin Vanschoren, the authors of this blog.   OpenML is an online platform and service for machine learning, whose goal is to make machine learning and data analysis simple, accessible, collaborative and open with an optimal division of labour between computers and humans. People can upload and share data sets and questions (prediction tasks) on OpenML that they then collaboratively solve using machine learning algorithms. We first heard about the Frictionless Data project through School of Data. One of the OpenML core members is also involved in School of Data and used Frictionless Data’s data packages in one of the open data workshops from School of Data Switzerland. We offer open source tools to download data into your favourite machine learning environments and work with it. You can then upload your results back onto the platform so that others can learn from you. If you have data, you can use OpenML to get insights on what machine learning method works well to answer your question. Machine Learners can use OpenML to find interesting data sets and questions that are relevant for others and also for machine learning research (e.g. learning how algorithms behave on different types of data sets).

Image of data set list on OpenML

OpenML currently works with tabular data in Attribute Relation File Format (ARFF) accompanied by metadata in an xml or json file. It is actually very similar to Frictionless Data’s tabular data package specification, but with ARFF instead of csv. 

Image of a data set overview on openML

In the coming months, we are looking to adopt Frictionless Data specifications to improve user friendliness on OpenML. We hope to make it possible for users to upload and connect datasets in data packages format. This will be a great shift because it would enable people to easily build and share machine learning models trained on any dataset in the frictionless data ecosystem. We firmly believe that if data packages become the go-to specification for sharing data in scientific communities, accessibility to data that’s currently ‘hidden’ in data platforms and university libraries will improve vastly, and are keen to adopt and use the specification on OpenML in the coming months. Interested in contributing to OpenML’s quest to adopt the data package specification as an import and export option for data on the OpenML platform? Start here.

OKI wins funds from ODI to create Open Data publication toolkit

Jo Barratt - October 31, 2017 in Data Quality, Frictionless Data, News, ODI

Open Knowledge International (OKI) has been awarded funds by the Open Data Institute (ODI) as part of a project to enhance and increase adoption of tools and services for open data publishers in the private and public sectors, reducing barriers to publication. OKI’s focus in this programme will be to create better open data publication workflows by building on our earlier work on the Frictionless Data initiative. We will be implementing significant incremental improvements to a range of code libraries and tools that are loosely aligned around our Frictionless Data project, in which we are working on removing the friction in working with data by developing a set of tools, standards, and best practices for publishing data. The work will be presented as part of a new toolkit which will be specifically targeted at both technical and non-technical users of data, within the public sector, businesses, and the data community. We will perform additional user research in government and non-governmental contexts, design and enhance user interfaces for non-technical users, implement integrations of tooling with existing workflows as well as working towards new ones. The reports, research and tools produced will become practical assets that can be used and added to by others, to continue to explore how data can and should work in our societies and economies. Innovate UK, the UK’s innovation agency, is providing £6 million over three years to the ODI, to advance knowledge and expertise in how data can shape the next generation of public and private services, and create economic growth. The work on improving the conditions for data publishing is one of six projects, chosen by the ODI, in this first year of the funding. Olivier Thereaux, Head of Technology at the ODI said:
‘Our goals in this project are to truly understand what barriers exist to publishing high quality data quickly and at reasonable cost. We’re happy to be working with OKI, and to be building on its Frictionless Data initiative to further the development of simpler, faster, higher quality open data publishing workflows. ‘

On announcing the funding on 17th October, Dr Jeni Tennison, CEO at the ODI said:
‘The work we are announcing today will find the best examples of things working well, so we can share and learn from them. We will take these learnings and help businesses and governments to use them and lead by example.’
A major focus for the Product Team at Open Knowledge International over the last two years has been around data quality and automation of data processing. Data quality is arguably the greatest barrier to useful and usable open data and we’ve been directly addressing this via specifications and tooling in Frictionless Data over the last two years. Our focus in this project will be to develop ways for non-technical users to employ tools for automation, reducing the potential for manual error, and increasing productivity. We see speed of publication and lowering costs of publication as two areas that are directly enhanced by having better tooling and workflows to address quality and automation and this is something which the development of this toolkit will directly address. People are fundamental to quality, curated, open data publication workflows. However, by automating more aspects of the “publication pipeline”, we not only reduce the need for manual intervention, we also can increase the speed at which open data can be published.

To keep up to date on our progress, join the Frictionless Data Discuss forum, or ask the team a direct question on the gitter channel.

eLife: Facilitating data validation & reuse with goodtables

Naomi Penfold - October 25, 2017 in Frictionless Data, goodtables, pilot

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. Through a series of pilots, we are working directly with organizations to solve real problems managing data.  eLife is a non-profit organisation with a mission to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers.  In this blog, Jo Barrat, Adria Mercader and Naomi Penfold share learnings from  a pilot of Frictionless Data’s goodtables on data shared with eLife. eLife is a non-profit organisation with a mission  to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers. “Ensuring data availability alone is insufficient for the true potential of open data to be realised. The push from journals and funders at the moment is to encourage sharing, which is the first step towards reuse. The next step is to consider how we ensure actually reusable data. Any efforts to make it easier for researchers to prepare high quality reusable datasets, and to do so with minimal effort, are welcome. Further, tools that reduce the burden of reviewing datasets are of interest to data publishers.”
– 
Naomi Penfold, eLife

Use Case

Data sharing is an important cornerstone in the movement towards more reproducible science: it provides a means to validate assertions made, which is why many journals and funders require that research data is shared publicly and appropriately within a reasonable timeframe following a research project. At eLife, authors are encouraged to deposit their data in an appropriate external repository and to cite the datasets in their article or, where this is not possible or suitable, publish the source data as supplements to the article itself. The data files are then stored in the eLife data store and made available through download links available within the article.

Source data shared with eLife is listed under the Figures and data tab. Source: Screenshot from eLife 2017;6:e29820.

Open research data is an important asset in the record of the original research, and its reuse in different contexts help make the research enterprise more efficient. Sharing and reuse of research data is fairly common, and researchers may reuse others’ data more readily than they might share their own. The exact nature of data reuse however is less clear: forty percent of Wellcome Trust-funded researchers make their data available as open access, and three-quarters report reusing existing data for validation, contextualisation, methodological development, and novel analyses, for example (Van den Eynden et al, 2016). Interestingly, a third of researchers who never publish their own data report reusing other researcher’s open data (Treadway et al, 2016) and dataset citation by researchers other than the original authors appears to be growing at least in line with greater availability of data (for gene expression microarray analysis; Piwowar & Vision, 2013). However, only a minority of citations (6% of 138) pertained to actual data reuse when citation context was manually classified in this study. Indeed, the quality of the data and its documentation were listed as important factors when Wellcome Trust-funded researchers were deciding whether to reuse a dataset or not (Van den Eynden et al, 2016). Very few formal studies that look into the problems faced by researchers when attempting to reuse open data have been published. Anecdotal evidence from conversations with life scientists indicates that:
  1. The process of preparing open data for reuse — including cleaning, restructuring, and comparing multiple datasets prior to combining — is onerous and time-consuming.
    The time and effort it takes for researchers to prepare their own data for repository deposition is considered a barrier to sharing. Further, the quality of the data and its documentation are important factors when deciding whether to reuse a dataset or not. (Van den Eynden et al, 2016)
    This is why projects that improve the reusability of research data in a way that requires minimal effort on the researcher’s part are of interest within the eLife Innovation Initiative.
  2. There is also a sparsity of formal structures for secondary data users to openly collaborate with original data providers, to share the work of improving quality of open research data. Such infrastructure could provide the social and academic feedback cycle in a rapid enough timescale to fuel a rich and useful Open Data ecosystem. While the utility of goodtables does not extend to this use case, it is the first step along this pathway.
These problems are relevant not only to open data in academic research but also to government data. Similarly to moving beyond incentivising sharing data to encouraging sharing reusable data for research, we shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change. Without a simple, solid foundation of structural integrity, schematic consistency, and timely release, we will not meet quality standards higher up in the chain. We need to have essential quality assurances in plain text publication of data first, for data that is published via manual and automated means. We shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change. For our Frictionless Data pilot work, we analyzed 3910 articles, 1085 of which had data files. The most common format was Microsoft Excel Open XML Format Spreadsheet (xlsx), with 89% of all 4318 files being published on this format. Older versions of Excel and CSV files made up the rest.

A summary of the eLife research articles analysed as part of the Frictionless Data pilot work

In terms of  validation, more than three quarters of the articles analyzed contained at least one invalid file. Following analysis of a sample of the results, the vast majority of the errors appear to be due to the data being presented in aesthetically pleasing tables, using formatting to make particular elements more visually clear, as opposed to a machine-readable format.

Data from Maddox et al. was shared in a machine-readable format (top), and adapted here to demonstrate how such data are often shared in a format that looks nice to the human reader (bottom). Source: Source data The data file is presented as is and adapted from Maddox et al. eLife 2015;4:e04995 under the Creative Commons Attribution License (CC BY 4.0).

This is not limited to the academic field of course, and the tendency to present data in spreadsheets so it is visually appealing is perhaps more prevalent in other areas – perhaps because consumers of the data are even less likely to have the data processed by machines or because the data is collated by people with no experience of having to use it in their work. Work to improve the reusability of research data pushes towards an ideal situation where most data is both machine-readable and human-comprehensible. In general the eLife datasets had better quality than for instance those created by government organisations, where structural issues such as missing headers and extra cells are much more common. So although the results here have been good, the community may derive greater benefit from researchers going that extra mile to make files more machine-friendly and embrace more robust data description techniques like Data Packages. Overall, the findings from this pilot demonstrate that there are different ways of producing data for sharing: datasets are predominantly presented in an Excel file with human aesthetics in mind, rather than structured for use by a statistical program. We found few issues with the data itself beyond presentation preferences. This is encouraging and is a great starting point for venturing forward with helping researchers to make greater use of open data. You can read more about this work in the Frictionless Data Pilot writeup. Parts of this piece are cross-posted on eLife Labs.

Frictionless Data v1.0

Paul Walsh - September 5, 2017 in Frictionless Data, Open Data

  Data Containerisation hits v1.0! Announcing a major milestone in the Frictionless Data initiative. Today, we’re announcing a major milestone in the Frictionless Data initiative with the official v1.0 release of the Frictionless Data specifications, including Table Schema and Data Package, along with a robust set of pre-built tooling in Python, R, Javascript, Java, PHP and Go. Frictionless Data is a collection of lightweight specifications and tooling for effortless collection, sharing, and validation of data. After close to 10 years of iterative work on the specifications themselves, and the last 6 months of fine-tuning v1.0 release candidates, we are delighted to announce the availability of the following: We want to thank our funder, the Sloan Foundation, for making this release possible.

What’s inside

A brief overview of the main specifications follows. Further information is available on the specifications website.
  • Table Schema: Provides a schema for tabular data. Table Schema is well suited for use cases around handling and validating tabular data in plain text formats, and use cases that benefit from a portable, language agnostic schema format.
  • CSV Dialect: Provides a way to declare a dialect for CSV files.
  • Data Resource: Provides metadata for a data source in a consistent and machine-readable manner.
  • Data Package: Provide metadata for a collection of data sources in a consistent and machine-readable manner.
The specifications, and the code libraries that implement them, compose to form building blocks for working with data, as illustrated with the following diagram. This component based approach lends itself well to the type of data processing work we often encounter in working with open data. It has also enabled us to build higher-level applications that specifically target common open data workflows, such as our goodtables library for data validation, and our pipelines library for declarative ETL.

v1.0 work

In iterating towards a v1 of the specifications, we tried to sharpen our focus on the design philosophy of this work, and not be afraid to make significant, breaking changes in the name of increased simplicity and utility. What is the design philosophy behind this work, exactly?
  • Requirements that are driven by simplicity
  • Extensibility and customisation by design
  • Metadata that is human-editable and machine-usable
  • Reuse of existing standard formats for data
  • Language-, technology- and infrastructure-agnostic
In striving for these goals, we removed much ambiguity from the specifications, cut features that were under-defined, removed and reduced various types of optionality in the way things could be specified, and even made some implicit patterns explicit by way of creating two new specifications: Data Resource and Tabular Data Resource. See the specifications website for full information.

Next steps

We are preparing to submit Table Schema, Data Resource, Data Package, Tabular Data Resource and Tabular Data Package as IETF RFCs as soon as possible. Lastly, we’ve recently produced a video to explain our work on Frictionless Data. Here, you can get a high-level overview of the concepts and philosophy behind this work, presented by our President and Co-Founder Rufus Pollock.  

Data-cards – a design pattern

Sam Smith - August 15, 2017 in Frictionless Data, Open Knowledge

Cross-posted on smth.uk
It can be useful to recognise patterns in the challenges we face, and in our responses to those challenges. In doing this, we can build a library of solutions, a useful resource when similar challenges arise in the future. When working on innovative projects, as is often the case at Open Knowledge International, creating brand new challenges is inevitable. With little or no historical reference material on how best to tackle these challenges, paying attention to your own repeatable solutions becomes even more valuable. From a user interface design point of view, these solutions come in the form of design patterns – reusable solutions to commonly occurring problems. Identifying, and using design patterns can help create familiar processes for users; and by not reinventing the wheel, you can save time in production too. In our work on Data Packages, we are introducing a new task into the world – creating those data packages. This task can be quite simple, and it will ultimately be time saving for people working with data. That said, there is no escaping the fact that this is a task that has never before been asked of people, one that will need to be done repeatedly, and potentially, from within any number of interfaces. It has been my task of late to design some of these interfaces; I’d like to highlight one pattern that is starting to emerge – the process of describing, or adding metadata to, the columns of a data table. I was first faced with this challenge when working on OS Packager. The objective was to present a recognisable representation of the columns, and facilitate the addition of metadata for each of those columns. The adding of data would be relatively straight forward, a few form fields. The challenge lay in helping the user to recognise those columns from the tables they originated. As anyone who works with spreadsheets on a regular basis will know, they aren’t often predictably or uniformly structured, meaning it is not always obvious what you’re looking at. Take them out of the familiar context of the application they were created in, and this problem could get worse. For this reason, just pulling a table header is probably not sufficient to identify a column. We wanted to provide a preview of the data, to give the best chance of it being recognisable. In addition to this, I felt it important to keep the layout as close as possible to that of say Excel. The simplest solution would be to take the first few rows of the table, and put a form under each column, for the user to add their metadata.     This is a good start, about as recognisable and familiar as you’re going to get. There is one obvious problem though, this could extend well beyond the edge of the users screen, leading to an awkward navigating experience. For an app aimed at desktop users, horizontal scrolling, in any of its forms, would be problematic. So, in the spirit of the good ol’ webpage, let’s make this thing wrap. That is to say that when an element can not fit on the screen, it moves to a new “line”. When doing this we’ll need some vertical spacing where this new line occurs, to make it clear that one column is separate from the one above it. We then need horizontal spacing to prevent the false impression of grouping created by the rows.     The data-card was born. At the time of writing it is utilised in OS Packager, pretty closely resembling the above sketch.     Data Packagist is another application that creates data packages, and it faces the same challenges as described above. When I got involved in this project there was already a working prototype, I saw in this prototype data cards beginning to emerge. It struck me that if these elements followed the same data card pattern created for OS Packager, they could benefit in two significant ways. The layout and data preview would again allow the user to more easily recognise the columns from their spreadsheet; plus the grid layout would lend itself well to drag and drop, which would mean avoiding multiple clicks (of the arrows in the screenshot above) when reordering. I incorporated this pattern into the design.     Before building this new front-end, I extracted what I believe to be the essence of the data-card from the OS Packager code, to reuse in Data Packagist, and potentially future projects. While doing so I thought about the current and potential future uses, and the other functions useful to perform at the same time as adding metadata. Many of these will be unique to each app, but there are a couple that I believe likely to be recurring:
  • Reorder the columns
  • Remove / ignore a column
These features combine with those of the previous iteration to create this stand-alone data-card project: Time will tell how useful this code will be for future work, but as I was able to use it wholesale (changing little more than a colour variable) in the implementation of the Data Packagist front-end, it came at virtually no additional cost. More important than the code however, is having this design pattern as a template, to solve this problem when it arises again in the future.

Data-cards – a design pattern

Sam Smith - August 15, 2017 in Frictionless Data, Open Knowledge

Cross-posted on smth.uk
It can be useful to recognise patterns in the challenges we face, and in our responses to those challenges. In doing this, we can build a library of solutions, a useful resource when similar challenges arise in the future. When working on innovative projects, as is often the case at Open Knowledge International, creating brand new challenges is inevitable. With little or no historical reference material on how best to tackle these challenges, paying attention to your own repeatable solutions becomes even more valuable. From a user interface design point of view, these solutions come in the form of design patterns – reusable solutions to commonly occurring problems. Identifying, and using design patterns can help create familiar processes for users; and by not reinventing the wheel, you can save time in production too. In our work on Data Packages, we are introducing a new task into the world – creating those data packages. This task can be quite simple, and it will ultimately be time saving for people working with data. That said, there is no escaping the fact that this is a task that has never before been asked of people, one that will need to be done repeatedly, and potentially, from within any number of interfaces. It has been my task of late to design some of these interfaces; I’d like to highlight one pattern that is starting to emerge – the process of describing, or adding metadata to, the columns of a data table. I was first faced with this challenge when working on OS Packager. The objective was to present a recognisable representation of the columns, and facilitate the addition of metadata for each of those columns. The adding of data would be relatively straight forward, a few form fields. The challenge lay in helping the user to recognise those columns from the tables they originated. As anyone who works with spreadsheets on a regular basis will know, they aren’t often predictably or uniformly structured, meaning it is not always obvious what you’re looking at. Take them out of the familiar context of the application they were created in, and this problem could get worse. For this reason, just pulling a table header is probably not sufficient to identify a column. We wanted to provide a preview of the data, to give the best chance of it being recognisable. In addition to this, I felt it important to keep the layout as close as possible to that of say Excel. The simplest solution would be to take the first few rows of the table, and put a form under each column, for the user to add their metadata.     This is a good start, about as recognisable and familiar as you’re going to get. There is one obvious problem though, this could extend well beyond the edge of the users screen, leading to an awkward navigating experience. For an app aimed at desktop users, horizontal scrolling, in any of its forms, would be problematic. So, in the spirit of the good ol’ webpage, let’s make this thing wrap. That is to say that when an element can not fit on the screen, it moves to a new “line”. When doing this we’ll need some vertical spacing where this new line occurs, to make it clear that one column is separate from the one above it. We then need horizontal spacing to prevent the false impression of grouping created by the rows.     The data-card was born. At the time of writing it is utilised in OS Packager, pretty closely resembling the above sketch.     Data Packagist is another application that creates data packages, and it faces the same challenges as described above. When I got involved in this project there was already a working prototype, I saw in this prototype data cards beginning to emerge. It struck me that if these elements followed the same data card pattern created for OS Packager, they could benefit in two significant ways. The layout and data preview would again allow the user to more easily recognise the columns from their spreadsheet; plus the grid layout would lend itself well to drag and drop, which would mean avoiding multiple clicks (of the arrows in the screenshot above) when reordering. I incorporated this pattern into the design.     Before building this new front-end, I extracted what I believe to be the essence of the data-card from the OS Packager code, to reuse in Data Packagist, and potentially future projects. While doing so I thought about the current and potential future uses, and the other functions useful to perform at the same time as adding metadata. Many of these will be unique to each app, but there are a couple that I believe likely to be recurring:
  • Reorder the columns
  • Remove / ignore a column
These features combine with those of the previous iteration to create this stand-alone data-card project: Time will tell how useful this code will be for future work, but as I was able to use it wholesale (changing little more than a colour variable) in the implementation of the Data Packagist front-end, it came at virtually no additional cost. More important than the code however, is having this design pattern as a template, to solve this problem when it arises again in the future.

Frictionless Data: Introducing our Tool Fund Grantees

Serah Rono - July 3, 2017 in Frictionless Data

Frictionless Data is an Open Knowledge International project which started over 10 years ago as a community-driven effort of Open Knowledge LabsOver the last 3 years, with funding from partners like the Sloan Foundation and Google, the Frictionless Data team has worked tirelessly to remove ‘friction’ from working with data. A well-defined set of specifications and standards has been published and used by organizations in the data space, our list of data packages has grown and our tools are evolving to cater to some of the issues that people encounter while working with data. Owing to our growing community of users and the need to extend implementation of version 1.0 specifications to additional programming languages, we launched the Frictionless Data Tool Fund in March 2017. The one-time $5,000 grant per implementation language has so far attracted 70+ applications from our community of mostly exceptional developers from all over the world. At this time, we have awarded four grants for the implementation of Frictionless Data libraries in R, Go, Java and PHP. We recently asked the grantees to tell us about themselves, their work with data and what they will be doing with their grants. These posts are part of the grantee profile series, written to shine a light on Frictionless Data’s Tool Fund grantees, their work and to let our technical community know how they can get involved.  

Ori Hoch: Frictionless Data Tool Fund PHP grantee

“I would also love to have PHP developers use the core libraries to write some more high-level tools. With the availability of the PHP libraries for Frictionless Data the task of developing such plugins will be greatly simplified”

From juggling fire clubs to working with data at The Museum of The Jewish People, to how he envisions use of the Frictionless Data PHP libraries to develop high level tools, read Ori Hoch’s profile to find out what he will be working on and how you can be a part of it.

Daniel Fireman: Frictionless Data Tool Fund Go grantee

I hope to use the Tool Fund grant I received to bring Go’s performance and concurrency capabilities to data processing and to have a set of tools distributed as standalone and multi-platform binaries
Read Daniel Fireman’s profile and find out more about his work on social network platforms, how he’s used GO to improve data transparency in Brazil, the challenges he has encountered while working with data and how he intends to alleviate these things  using his Go grant.

Open Knowledge Greece: Frictionless Data Tool Fund R grantee

We are going to implement two Frictionless Data libraries in R – Table Schema for publishing and sharing tabular-style data and Data Package for describing a coherent collection of data in a single package, keeping the frictionless data specifications.
Read Open Knowledge Greece’s profile and find out more about Open Knowledge festival that they will be hosting in Thessaloniki, Greece in 2018, plus, why they are excited to be implementing Frictionless data libraries in R and how you can contribute to their efforts.

Georges Labrèche: Frictionless Data Tool Fund Java Grantee

Data is messy and, for developers, cleaning and processing data from one project to another can quickly turn an awesome end-product idea into a burdensome chore. Data packages and Frictionless Data tools and libraries are important because they allow developers to focus more on the end-product itself without having to worry about heavy lifting in the data processing pipeline.
From travel, to physics and astronautics, read Georges Labrèche’s profile and find out more about his interests and  work at Open Data Kosovo as well as how you can follow and contribute to his work around Frictionless Data’s Java libraries.   Frictionless Data is a project of Open Knowledge International. Interested in knowing more? Read about the standards, data and tools, and find out how to get involved. All code developed in support of this project is hosted on the frictionlessdata organization on GitHub. Contributions are welcome. Have a question or comment?  Let us know on our Discuss forum or on our Gitter chat.

Always Already Computational Reflections

Daniel Fowler - June 21, 2017 in Frictionless Data

Always Already Computational is a project bringing together a variety of different perspectives to develop “a strategic approach to developing, describing, providing access to, and encouraging reuse of collections that support computationally-driven research and teaching” in subject areas relating to library and museum collections.  This post is adapted from my Position Statement for the initial workshop.  You can find out more about the project at https://collectionsasdata.github.io. Earlier this year, I spent two and a half days in beautiful University of California Santa Barbara at a workshop speaking with librarians, developers, and museum and library collection managers about data.  Attendees at this workshop represented a variety of respected cultural institutions including the New York Public Library, the British Library, the Internet Archive, and others. Our task was to build a collective sense of what it means to treat library and museum “collections”—the (increasingly) digitized catalogs of their holdings—as data for analysis, art, research, and other forms of re-use.  We gathered use cases and user stories in order to start the conversation on how to best publish collections for these purposes.  Look for further outputs on the project website: https://collectionsasdata.github.io .  For the moment, here are my thoughts on the experience and how it relates to work at Open Knowledge International, specifically, Frictionless Data.

Always Already Computational

Open Access to (meta)Data

The event organizers—Thomas Padilla (University of California Santa Barbara), Laurie Allen (University of Pennsylvania), Stewart Varner (University of Pennsylvania), Sarah Potvin (Texas A&M University), Elizabeth Russey Roke (Emory University), Hannah Frost (Stanford University)—took an expansive view of who should attend.  I was honored and excited to join, but decidedly new to Digital Humanities (DH) and related fields.  The event served as an excellent introduction, and I now understand DH to be a set of approaches toward interrogating recorded history and culture with the power of our current tools for data analysis, visualization, and machine learning.  As part of the Frictionless Data project at Open Knowledge International, we are building apps, libraries, and specifications that support the basic transport and description of datasets to aid in this kind of data-driven discovery.  We are trialling this approach across a variety of fields, and are interested to determine the extent to which it can improve research using library and museum collection data. What is library and museum collection data?  Libraries and museums hold physical objects which are often (although not always) shared for public view on the stacks or during exhibits.  Access to information (metadata) about these objects—and the sort of cultural and historical research dependent on such access—has naturally been somewhat technologically, geographically, and temporally restricted.  Digitizing the detailed catalogues of the objects libraries and museums hold surely lowered the overhead of day-to-day administration of these objects, but also provided a secondary public benefit: sharing this same metadata on the web with a permissive license allows a greater variety of users in the public—researchers, students of history, and others—to freely interrogate our cultural heritage in a manner they choose.   There are many different ways to share data on the web, of course, but they are not all equal.  A low impact, open, standards-based set of approaches to sharing collections data that incorporates a diversity of potential use cases is necessary.  To answer this need, many museums are currently publishing their collection data online, with permissive licensing, through GitHub: The Tate Galleries in the UK, Cooper Hewitt, Smithsonian Design Museum and The Metropolitan Museum of Art in New York have all released their collection data in CSV (and JSON) format on this popular platform normally used for sharing code.  See A Nerd’s Guide To The 2,229 Paintings At MoMA and An Excavation Of One Of The World’s Greatest Art Collections both published by FiveThirtyEight for examples of the kind of exploratory research enabled by sharing museum collection data in bulk, in a straightforward, user-friendly way.  What exactly did they do, and what else may be needed?

Packages of Museum Data

Our current funding from the Sloan Foundation enables us to focus on this researcher use case for consuming data.  Across fields, the research process is often messy, and researchers, even if they are asking the right questions, possess a varying level of skill in working with datasets to answer them.  As I wrote in my position statement:
Such data, released on the Internet under open licenses, can provide an opportunity for researchers to create a new lens onto our cultural and artistic history by sparking imaginative re-use and analysis.  For organizations like museums and libraries that serve the public interest, it is important that data are provided in ways that enable the maximum number of users to easily process it.  Unfortunately, there are not always clear standards for publishing such data, and the diversity of publishing options can cause unnecessary overhead when researchers are not trained in data access/cleaning techniques.

My experience at this event, and some research beforehand, suggested that there is a spectrum of data release approaches ranging from a basic data “dump” as conducted by the museums referenced above to more advanced, though higher investment, approaches such as publishing data as an online service with a public “API” (Application Programming Interface).  A public API can provide a consistent interface to collection metadata, as well as an ability to request only the needed records, but comes at the cost of having the nature of the analysis somewhat preordained by its design.  In contrast, in the data dump approach, an entire dataset, or a coherent chunk of it, can be easier for some users to access and load directly into a tool like R (see this UK Government Digital Service post on the challenges of each approach) without needing advanced programming.  As a format for this bulk download, CSV is the best choice as the MoMa reflected when releasing their collection data online:
CSV is not just the easiest way to start but probably the most accessible format for a broad audience of researchers, artists, and designers.  
This, of course, comes at the cost of not having a less consistent interface for the data, especially in the case of the notoriously underspecified CSV format.  The README file will typically go into some narrative detail about how to best use the dataset, some expected “gotchas” (e.g. “this UTF-8 file may not work well with Excel on a Mac”).  It might also list the columns in a tabular data file stored in the dataset, expected types and formats for values in each column (e.g. the date_acquired column should, hopefully, contain dates in a one or another international format).  This information is critical for actually using the data, and the automated export process that generates the public collection dataset from the museum’s internal database may try to ensure that the data matches expectations, but bugs exist, and small errors may go unnoticed in the process. The Data Package descriptor (described in detail on our specifications site), used in conjunction with Data Package-aware tooling, is meant to somewhat restore the consistent interface provided by an API by embedding this “schema” information with the data.  This allows the user or the publisher to check that the data conforms to expectations without requiring modification of the data itself: a “packaged” CSV can still be loaded into Excel as-is (though without the benefit of type checking enabled by the Data Package descriptor).  The Carnegie Museum of Art, in its release of its collection data, follows the examples set by the Tate, the Met, the Moma, and Cooper-Hewitt as described above, but opted to also include a Data Package descriptor file to help facilitate online validation of the dataset through tools such as Good Tables.  As tools come online for editing, validating, and transforming Data Packages, users of this dataset should be able to benefit from those, too: http://frictionlessdata.io/tools/. We are a partner in the Always Already Computational: Collections as Data project, and as part of this work, we are working with Carnegie Museum of Art to provide a more detailed look at the process that went into the creation of the CMOA dataset, as well as sketching a potential ways in which the Data Package might help enable re-use of this data.  In the meantime, check out our other case studies on the use of Data Package in fields as diverse as ecology, cell migration, and energy data: http://frictionlessdata.io/case-studies/ Also, pay your local museum or library a visit.

csv,conf,v3

Daniel Fowler - May 30, 2017 in Events, Frictionless Data, OD4D, Open Spending

The third manifestation of everyone’s favorite community conference about data—csv,conf,v3—happened earlier this May in Portland, Oregon. The conference brought together data makers/doers/hackers from various backgrounds to share knowledge and stories about data in a relaxed, convivial, alpaca-friendly (see below) environment. Several Open Knowledge International staff working across our Frictionless Data, OpenSpending, and Open Data for Development projects made the journey to Portland to help organize, give talks, and exchange stories about our lives with data. Thanks to Portland and the Eliot Center for hosting us. And, of course, thanks to the excellent keynote speakers Laurie Allen, Heather Joseph, Mike Bostock, and Angela Bassa who provided a great framing for the conference through their insightful talks. Here’s what we saw.

Talks We Gave

The first priority for the team was to present on the current state of our work and Open Knowledge International’s mission more generally. In his talk, Continuous Data Validation for Everybody, developer Adrià Mercader updated the crowd on the launch and motivation of goodtables.io: It was a privilege to be able to present our work at one of my favourite conferences. One of the main things attendees highlight about csv,conf is how diverse it is: many different backgrounds were represented, from librarians to developers, from government workers to activists. Across many talks and discussions, the need to make published data more useful to people came up repeatedly. Specifically, how could we as a community help people publish better quality data? Our talk introducing goodtables.io presented what we think will be a dominant approach to approaching this question: automated validation. Building on successful practices in software development like automated testing, goodtables.io integrates within the data publication process to allow publishers to identify issues early and ensure data quality is maintained over time. The talk was very well received, and many people reached out to learn more about the platform. Hopefully, we can continue the conversation to ensure that automated (frictionless) data validation becomes the standard on all data publication workflows. David Selassie Opoku presented When Data Collection Meets Non-technical CSOs in Low-Income Areas: csv,conf was a great opportunity to share highlights of the OD4D (and School of Data) team’s data collection work. The diverse audience seemed to really appreciate insights on working with non-technical CSOs in low-income areas to carry out data collection. In addition to highlighting the lessons from the work and its potential benefit to other regions of the world, I got to connect with data literacy organisations such as Data Carpentry who are currently expanding their work in Africa and could help foster potential data literacy training partnerships. As a team working with CSOs in low-income areas like Africa, School of Data stands to benefit from continuing conversations with data “makers” in order to present potential use cases. A clear example I cited in my talk was Kobo Toolbox, which continues to mitigate several daunting challenges of data collection through abstraction and simple user interface design. Staying in touch with the csv,conf community may highlight more such scenarios which could lead to the development of new tools for data collection. Paul Walsh, in his talk titled Open Data and the Question of Quality (slides) talked about lessons learned from working on a range of government data publishing projects and we can do as citizens to demand better quality data from our governments:

Talks We Saw

Of course, we weren’t there only to present; we were there to learn from others as well. Before the conference, through our Frictionless Data project, we have been lucky to be in contact with various developers and thinkers around the world who also presented talks at the conference. Eric Busboom presented Metatab, an approach to packaging metadata in spreadsheets. Jasper Heefer of Gapminder talked about DDF, a data description format and associated data pipeline tool to help us live a more fact-based existence. Bob Gradeck of the Western Pennsylvania Regional Data Center talked about data intermediaries in civic tech, a topic near and dear to our hearts here at Open Knowledge International.

Favorite Talks

Paul’s:
  • “Data in the Humanities Classroom” by Miriam Posner
  • “Our Cities, Our Data” by Kate Rabinowitz
  • “When Data Collection Meets Non-technical CSOs in Low Income Areas” by David Selassie Opoku
David’s:
  • “Empowering People By Democratizing Data Skills” by Erin Becker
  • “Teaching Quantitative and Computational Skills to Undergraduates using Jupyter Notebooks” by Brian Avery
  • “Applying Software Engineering Practices to Data Analysis” by Emil Bay
  • “Open Data Networks with Fieldkit” by Eric Buth
Jo’s:
  • “Smelly London: visualising historical smells through text-mining, geo-referencing and mapping” by Deborah Leem
  • “Open Data Networks with Fieldkit” by Eric Buth
  • “The Art and Science of Generative Nonsense” Mouse Reeve
  • “Data Lovers in in a Dangerous Time” by Bendan O’Brien

Data Tables

This csv,conf was the first csv,conf to have a dedicated space for working with data hands-on. In past events, attendees left with their heads buzzing full of new ideas, tools, and domains to explore but had to wait until returning home to try them out. This time we thought: why wait? During the talks, we had a series of hands-on workshops where facilitators could walk through a given product and chat about the motivations, challenges, and other interesting details you might not normally get to in a talk. We also prepared several data “themes” before the conference meant to bring people together on a specific topic around data. In the end, these themes proved a useful starting point for several of the facilitators and provided a basis for a discussion on cultural heritage data following on from a previous workshop on the topic. The facilitated sessions went well. Our own Adam Kariv walked through Data Package Pipelines, his ETL tool for data based on the Data Package framework. Jason Crawford demonstrated Fieldbook, a tool for managing easily managing a database in-browser as you would a spreadsheet. Bruno Vieira presented Bionode, going into fascinating detail on the mechanics of Node.js Streams. Nokome Bentley walked through a hands-on introduction to accessible, reproducible data analysis using Stencila, a way to create interactive, data-driven documents using the language of your choice to enable reproducible research. Representatives from data.world, an Austin startup we worked with on an integration for Frictionless Data also demonstrated uploading datasets to data.world. The final workshop was conducted by several members of the Dat team, including co-organizer Max Ogden, with a super enthusiastic crowd. Competition from the day’s talks was always going to be fierce, but it seems that many attendees found some value in the more intimate setting provided by Data Tables.

Thanks

If you were there at csv,conf in Portland, we hope you had a great time. Of course, our thanks go to the Gordon and Betty Moore Foundation and to Sloan Foundation for enabling me and my fellow organizers John Chodacki, Max Ogden, Martin Fenner, Karthik, Elaine Wong, Danielle Robinson, Simon Vansintjan, Nate Goldman and Jo Barratt who all put so much personal time and effort to bringing this all together. Oh, and did I mention the Comma Llama Alpaca? You, um, had to be there.

Frictionless Data Case Study: data.world

Jo Barratt - April 11, 2017 in Frictionless Data

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. The heart of Frictionless Data is the Data Package standard, a containerization format for any kind of data based on existing practices for publishing open-source software. We’re curious to learn about some of the common issues users face when working with data. In our Case Study series, we are highlighting projects and organisations who are working with the Frictionless Data specifications and tooling in interesting and innovative ways. For this case study, we interviewed Bryon Jacob of data.world. More case studies can be found at http://frictionlessdata.io/case-studies.

How do you use the Frictionless Data specs and what advantages did you find in using the Data Package approach?

We deal with a great diversity of data, both in terms of content and in terms of source format – most people working with data are emailing each other spreadsheets or CSVs, and not formally defining schema or semantics for what’s contained in these data files. When data.world ingests tabular data, we “virtualize” the tables away from their source format, and build layers of type and semantic information on top of the raw data. What this allows us to do is to produce a clean Tabular Data Package[^Package] for any dataset, whether the input is CSV files, Excel Spreadsheets, JSON data, SQLite Database files – any format that we know how to extract tabular information from – we can present it as cleaned-up CSV data with a datapackage.json that describes the schema and metadata of the contents. Available Data

What else would you like to see developed?

Graph data packages, or “Universal Data Packages” that can encapsulate both tabular and graph data. It would be great to be able to present tabular and graph data in the same package and develop tools that know how to use these things together. To elaborate on this, it makes a lot of sense to normalize tabular data down to clean, well-formed CSVs.or data that more graph-like, it would also make sense to normalize it to a standard format. RDF is a well-established and standardized format, with many serialized forms that could be used interchangeably (RDF XML, Turtle, N-Triples, or JSON-LD, for example). The metadata in the datapackage.json would be extremely minimal, since the schema for RDF data is encoded into the data file itself. It might be helpful to use the datapackage.json descriptor to catalog the standard taxonomies and ontologies that were in use, for example it would be useful to know if a file contained SKOS vocabularies, or OWL classes.

What are the next things you are going to be working on yourself?

We want to continue to enrich the metadata we include in Tabular Data Packages exported from data.world, and we’re looking into using datapackage.json as an import format as well as export.

How do the Frictionless Data specifications compare to existing proprietary and nonproprietary specifications for the kind of data you work with?

data.world works with lots of data across many domains – what’s great about the Frictionless Data specs is that it’s a lightweight content standard that can be a starting point for building domain-specific content standards – it really helps with the “first mile” of standardising data and making it interoperable. Available Data

What do you think are some other potential use cases?

In a certain sense, a Tabular Data Package is sort of like an open-source, cross-platform, accessible replacement for spreadsheets that can act as a “binder” for several related tables of data. I could easily imagine web or desktop-based tools that look and function much like a traditional spreadsheet, but use Data Packages as their serialization format.

Who else do you think we should speak to?

Data science IDE (Interactive Development Environment) producers – RStudio, Rodeo (python), anaconda, Jupyter – anything that operates on Data Frames as a fundamental object type should provide first-class tool and API support for Tabular Data Packages.

What should the reader do after reading this Case Study?

To read more about Data Package integration at data.world, read our post: Try This: Frictionless data.world. Sign up, and starting playing with data.   Have a question or comment? Let us know in the forum topic for this case study.