You are browsing the archive for Data Package.

Frictionless Data and FAIR Research Principles

- August 14, 2018 in Data Package, Frictionless Data

In August 2018, Serah Rono will be running a Frictionless Data workshop in CopenHagen, congregated by the Danish National Research Data Management Forum as part of the FAIR Across project. In October 2018, she will also run a Frictionless Data workshop at FORCE11 in Montreal, Canada. Ahead of the two workshops, and other events before the close of 2018, this blog post discusses how the Frictionless Data initiative aligns with FAIR research principles. An integral part of evidence-based research is gathering and analysing data, which takes time and often requires skill and specialized tools to aid the process. Once the work is done, reproducibility requires that research reports be shared with the data and software from which insights are derived and conclusions are drawn, if at all.  Widely lauded as a key measure of research credibility, reproducibility also makes a bold demand for openness by default in research, which in turn fosters collaboration. FAIR (findability, accessibility, interoperability and reusability) research principles are central to the open access and open research movements.
FAIR Guiding Principles precede implementation choices, and do not suggest any specific technology, standard, or implementation-solution; moreover, the Principles are not, themselves, a standard or a specification. They act as a guide to data publishers and stewards to assist them in evaluating whether their particular implementation choices are rendering their digital research artefacts Findable, Accessible, Interoperable, and Reusable.”

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016)

Data Packages in Frictionless Data as an example of FAIRness

Our Frictionless Data project aims to make it effortless to transport high quality data among different tools & platforms for further analysis. The Data Package format is at the core of Frictionless Data, and it makes it possible to package data and attach contextual information to it before sharing it.

An example data package

Data packages are nothing without the descriptor file. This descriptor file is made available in a machine readable format, JSON, and holds metadata for your collection of resources, and a schema for your tabular data.

Findability

In Data Packages, pieces of information are called resources. Each resource is referred to by name and has a globally unique identifier, with the provision to reference remote resources by URLs. Resource names and identifiers are held alongside other metadata in the descriptor file.

Accessibility

Since metadata is held in the descriptor file, it can be accessed separately from associated data. Where resources are available online – in an archive or data platform – sharing the descriptor file only is sufficient and data provenance is guaranteed for all associated resources.

Interoperability

The descriptor file is saved as a JSON file, a machine-readable format that can be processed with great ease by many different tools during data analysis. The descriptor file uses accessible and shared language, and has provision to add descriptions, and information on sources and contributors for each resource, which makes it possible to link to other existing metadata and guarantee data provenance. It is also very extensible, and can be expanded to accommodate additional information as needed.

Reusability

Part of the metadata held in a data package includes licensing and author information, and has a requirement to link back to original sources thus ensuring data provenance. This serves as a great guide for users interested in your resources. Where licensing allows for resources to be archived on different platforms, this means that regardless of where users access this data from, they will be able to trace back to original sources of the data as needed. For example, all countries of the world have unique codes attached to them. See how the Country Codes data package is represented on two different platforms:  GitHub, and on DataHub. With thanks to SLOAN Foundation for the new Frictionless Data For Reproducible Research grant, we will be running deep dive workshops to expound on these concepts and identify areas for improvement and collaboration in open access and open research. We have exciting opportunities in store, which we will announce in our community channels over time.

Bonus readings

Here are some of the ways researchers have adopted Frictionless Data software in different domains over the last two years:
  • The Cell Migration and Standardisation Organisation (CMSO) uses Frictionless Data specs to package cell migration data and load it into Pandas for data analysis and creation of visualizations. Read more.
  • We collaborated with Data Management for TEDDINET project (DM4T) on a proof-of-concept pilot in which we used Frictionless Data software to address some of the data management challenges faced by DM4T. Read more.
  • Open Power System Data uses Frictionless Data specifications to avail energy data for analysis and modeling. Read more.
  • We collaborated with Pacific Northwest National Laboratory – Active Data Biology and explored use of Frictionless Data software to generate schema for tabular data and check validity of metadata stored as part of a biological application on GitHub. Read more.
  • We collaborated with the UK Data service and used Frictionless Data software to assess and report on data quality, and made a case for generating visualisations with ensuing data and metadata. Read more.
Our team is also scheduled to run Frictionless Data workshops in the coming months:
  • In CopenHagen, congregated by the Danish National Research Data Management Forum as part of the FAIR Across project, in August 2018.
  • In Montreal, Canada, at FORCE11 between October 10 and 12, 2018. See the full program here and sign up here to attend the Frictionless Data workshop.

Frictionless Data Case Study: OpenML

- December 6, 2017 in case study, Data Package, Frictionless Data, Open Source

The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software. The Frictionless Data  case study series highlights projects and organisations who are working with Frictionless Data specifications and software in interesting and innovative ways. OpenML is one such organization. This case study has been made possible by OpenML’s Heidi Seibold and Joaquin Vanschoren, the authors of this blog.   OpenML is an online platform and service for machine learning, whose goal is to make machine learning and data analysis simple, accessible, collaborative and open with an optimal division of labour between computers and humans. People can upload and share data sets and questions (prediction tasks) on OpenML that they then collaboratively solve using machine learning algorithms. We first heard about the Frictionless Data project through School of Data. One of the OpenML core members is also involved in School of Data and used Frictionless Data’s data packages in one of the open data workshops from School of Data Switzerland. We offer open source tools to download data into your favourite machine learning environments and work with it. You can then upload your results back onto the platform so that others can learn from you. If you have data, you can use OpenML to get insights on what machine learning method works well to answer your question. Machine Learners can use OpenML to find interesting data sets and questions that are relevant for others and also for machine learning research (e.g. learning how algorithms behave on different types of data sets).

Image of data set list on OpenML

OpenML currently works with tabular data in Attribute Relation File Format (ARFF) accompanied by metadata in an xml or json file. It is actually very similar to Frictionless Data’s tabular data package specification, but with ARFF instead of csv. 

Image of a data set overview on openML

In the coming months, we are looking to adopt Frictionless Data specifications to improve user friendliness on OpenML. We hope to make it possible for users to upload and connect datasets in data packages format. This will be a great shift because it would enable people to easily build and share machine learning models trained on any dataset in the frictionless data ecosystem. We firmly believe that if data packages become the go-to specification for sharing data in scientific communities, accessibility to data that’s currently ‘hidden’ in data platforms and university libraries will improve vastly, and are keen to adopt and use the specification on OpenML in the coming months. Interested in contributing to OpenML’s quest to adopt the data package specification as an import and export option for data on the OpenML platform? Start here.

Google Funds Frictionless Data Initiative at Open Knowledge

- February 1, 2016 in BigQuery, ckan, Data Package, Google, News, Open Knowledge, Open Knowledge Foundation

We are delighted to announce that Open Knowledge has received funding from Google to work on tool integration for Data Packages as part of our broader work on Frictionless Data to support the open data community.

What are Data Packages?

The funding will support a growing set of tooling around Data Packages.  Data Packages provide functionality for data similar to “packaging” in software and “containerization” in shipping: a simple wrapper and basic structure for the transportation of data that significantly reduces the “friction” and challenges associated with data sharing and integration. Data Packages also support better automation in data processing and do so without imposing major changes on the underlying data being packaged.  As an example, comprehensive country codes is a Data Package which joins together standardized country information from various sources into a single CSV file. The Data Package format, at its simplest level, allows its creator to provide information describing the fields, license, and maintainer of the dataset, all in a machine-readable format. In addition to the basic Data Package format –which supports any data structure– there are other, more specialised Data Package formats: Tabular Data Package for tabular data and based on CSV, Geo Data Package for geodata based on GeoJSON. You can also extend Data Package with your own schemas and create topic-specific Data Packages like Fiscal Data Package for public financial data.   Screen Shot 2016-02-01 at 8.57.44 AM

What will be funded?

The funding supports adding Data Package integration and support to CKAN, BigQuery, and popular open-source SQL relational databases like PostgreSQL and MySQL / MariaDB.

CKAN Integration

CKAN is an open source data management system that is used by many governments and civic organizations to streamline publishing, sharing, finding and using data. This project implements a CKAN extension so that all CKAN datasets are automatically available as Data Packages through the CKAN API. In addition, the extension ensures that the CKAN API natively accepts Tabular Data Package metadata and preserves this information on round-tripping.

BigQuery Integration

This project also creates support for import and export of Tabular Data Packages to BigQuery, Google’s web service querying massive datasets. This involves scripting and a small online service to map Tabular Data Package to BigQuery data definitions. Because Tabular Data Packages already use CSV as the data format, this work focuses on the transformation of data definitions.

General SQL Integration

Finally, general SQL integration is being funded which would cover key open source databases like PostgreSQL and MySQL / MariaDB. This will allow data packages to be natively used in an even wider variety of software that depend on these databases than those listed above. These integrations move us closer to a world of “frictionless data”. For more information about our vision, visit: http://data.okfn.org/. Data OKFN If you have any questions, comments or would like more information, please visit this topic in our OKFN Discuss forum.