You are browsing the archive for goodtables.

Validation for Open Data Portals: a Frictionless Data Case Study

- December 18, 2017 in case study, ckan, Data Quality, Frictionless Data, goodtables

The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software. Through its pilots, Frictionless Data is working directly with organisations to solve real problems managing data. The University of Pittsburgh’s Center for Urban and Social Research is one such organisation. One of the main goals of the Frictionless Data project is to help improve data quality by providing easy to integrate libraries and services for data validation. We have integrated data validation seamlessly with different backends like GitHub and Amazon S3 via the online service, but we also wanted to explore closer integrations with other platforms. An obvious choice for that are Open Data portals. They are still one of the main forms of dissemination of Open Data, especially for governments and other organizations. They provide a single entry point to data relating to a particular region or thematic area and provide users with tools to discover and access different datasets. On the backend, publishers also have tools available for the validation and publication of datasets. Data quality varies widely across different portals, reflecting the publication processes and requirements of the hosting organizations. In general, it is difficult for users to assess the quality of the data and there is a lack of descriptors for the actual data fields. At the publisher level, while strong emphasis has been put in metadata standards and interoperability, publishers don’t generally have the same help or guidance when dealing with data quality or description. We believe that data quality in Open Data portals can have a central place on both these fronts, user-centric and publisher-centric, and we started this pilot to showcase a possible implementation. To field test our implementation we chose the Western Pennsylvania Regional Data Center (WPRDC), managed by the University of Pittsburgh Center for Urban and Social Research. WPRDC is a great example of a well managed Open Data portal, where datasets are actively maintained and the portal itself is just one component of a wider Open Data strategy. It also provides a good variety of publishers, including public sector agencies, academic institutions, and nonprofit organizations. The portal software that we are using for this pilot is CKAN, the world leading open source software for Open Data portals (source). Open Knowledge International initially fostered the CKAN project and is now a member of the CKAN Association. We created ckanext-validation, a CKAN extension that provides a low level API and readily available features for data validation and reporting that can be added to any CKAN instance. This is powered by goodtables, a library developed by Open Knowledge International to support the validation of tabular datasets. The ckanext-validation extension allows users to perform data validation against any tabular resource, such as  CSV or Excel files. This generates a report that is stored against a particular resource, describing issues found with the data, both at the structural level, such as missing headers and blank rows,  and at the data schema level, such as wrong data types and  out of range values. Read the technical details about this pilot study, our learnings and areas we have identified for further work in the coming days here on the Frictionless Data website.

eLife: Facilitating data validation & reuse with goodtables

- October 25, 2017 in Frictionless Data, goodtables, pilot

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. Through a series of pilots, we are working directly with organizations to solve real problems managing data.  eLife is a non-profit organisation with a mission to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers.  In this blog, Jo Barrat, Adria Mercader and Naomi Penfold share learnings from  a pilot of Frictionless Data’s goodtables on data shared with eLife. eLife is a non-profit organisation with a mission  to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers. “Ensuring data availability alone is insufficient for the true potential of open data to be realised. The push from journals and funders at the moment is to encourage sharing, which is the first step towards reuse. The next step is to consider how we ensure actually reusable data. Any efforts to make it easier for researchers to prepare high quality reusable datasets, and to do so with minimal effort, are welcome. Further, tools that reduce the burden of reviewing datasets are of interest to data publishers.”
Naomi Penfold, eLife

Use Case

Data sharing is an important cornerstone in the movement towards more reproducible science: it provides a means to validate assertions made, which is why many journals and funders require that research data is shared publicly and appropriately within a reasonable timeframe following a research project. At eLife, authors are encouraged to deposit their data in an appropriate external repository and to cite the datasets in their article or, where this is not possible or suitable, publish the source data as supplements to the article itself. The data files are then stored in the eLife data store and made available through download links available within the article.

Source data shared with eLife is listed under the Figures and data tab. Source: Screenshot from eLife 2017;6:e29820.

Open research data is an important asset in the record of the original research, and its reuse in different contexts help make the research enterprise more efficient. Sharing and reuse of research data is fairly common, and researchers may reuse others’ data more readily than they might share their own. The exact nature of data reuse however is less clear: forty percent of Wellcome Trust-funded researchers make their data available as open access, and three-quarters report reusing existing data for validation, contextualisation, methodological development, and novel analyses, for example (Van den Eynden et al, 2016). Interestingly, a third of researchers who never publish their own data report reusing other researcher’s open data (Treadway et al, 2016) and dataset citation by researchers other than the original authors appears to be growing at least in line with greater availability of data (for gene expression microarray analysis; Piwowar & Vision, 2013). However, only a minority of citations (6% of 138) pertained to actual data reuse when citation context was manually classified in this study. Indeed, the quality of the data and its documentation were listed as important factors when Wellcome Trust-funded researchers were deciding whether to reuse a dataset or not (Van den Eynden et al, 2016). Very few formal studies that look into the problems faced by researchers when attempting to reuse open data have been published. Anecdotal evidence from conversations with life scientists indicates that:
  1. The process of preparing open data for reuse — including cleaning, restructuring, and comparing multiple datasets prior to combining — is onerous and time-consuming.
    The time and effort it takes for researchers to prepare their own data for repository deposition is considered a barrier to sharing. Further, the quality of the data and its documentation are important factors when deciding whether to reuse a dataset or not. (Van den Eynden et al, 2016)
    This is why projects that improve the reusability of research data in a way that requires minimal effort on the researcher’s part are of interest within the eLife Innovation Initiative.
  2. There is also a sparsity of formal structures for secondary data users to openly collaborate with original data providers, to share the work of improving quality of open research data. Such infrastructure could provide the social and academic feedback cycle in a rapid enough timescale to fuel a rich and useful Open Data ecosystem. While the utility of goodtables does not extend to this use case, it is the first step along this pathway.
These problems are relevant not only to open data in academic research but also to government data. Similarly to moving beyond incentivising sharing data to encouraging sharing reusable data for research, we shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change. Without a simple, solid foundation of structural integrity, schematic consistency, and timely release, we will not meet quality standards higher up in the chain. We need to have essential quality assurances in plain text publication of data first, for data that is published via manual and automated means. We shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change. For our Frictionless Data pilot work, we analyzed 3910 articles, 1085 of which had data files. The most common format was Microsoft Excel Open XML Format Spreadsheet (xlsx), with 89% of all 4318 files being published on this format. Older versions of Excel and CSV files made up the rest.

A summary of the eLife research articles analysed as part of the Frictionless Data pilot work

In terms of  validation, more than three quarters of the articles analyzed contained at least one invalid file. Following analysis of a sample of the results, the vast majority of the errors appear to be due to the data being presented in aesthetically pleasing tables, using formatting to make particular elements more visually clear, as opposed to a machine-readable format.

Data from Maddox et al. was shared in a machine-readable format (top), and adapted here to demonstrate how such data are often shared in a format that looks nice to the human reader (bottom). Source: Source data The data file is presented as is and adapted from Maddox et al. eLife 2015;4:e04995 under the Creative Commons Attribution License (CC BY 4.0).

This is not limited to the academic field of course, and the tendency to present data in spreadsheets so it is visually appealing is perhaps more prevalent in other areas – perhaps because consumers of the data are even less likely to have the data processed by machines or because the data is collated by people with no experience of having to use it in their work. Work to improve the reusability of research data pushes towards an ideal situation where most data is both machine-readable and human-comprehensible. In general the eLife datasets had better quality than for instance those created by government organisations, where structural issues such as missing headers and extra cells are much more common. So although the results here have been good, the community may derive greater benefit from researchers going that extra mile to make files more machine-friendly and embrace more robust data description techniques like Data Packages. Overall, the findings from this pilot demonstrate that there are different ways of producing data for sharing: datasets are predominantly presented in an Excel file with human aesthetics in mind, rather than structured for use by a statistical program. We found few issues with the data itself beyond presentation preferences. This is encouraging and is a great starting point for venturing forward with helping researchers to make greater use of open data. You can read more about this work in the Frictionless Data Pilot writeup. Parts of this piece are cross-posted on eLife Labs.