You are browsing the archive for Data Package.

Frictionless Data Case Study: OpenML

Joaquin Vanschoren - December 6, 2017 in case study, Data Package, Frictionless Data, Open Source

The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software. The Frictionless Data  case study series highlights projects and organisations who are working with Frictionless Data specifications and software in interesting and innovative ways. OpenML is one such organization. This case study has been made possible by OpenML’s Heidi Seibold and Joaquin Vanschoren, the authors of this blog.   OpenML is an online platform and service for machine learning, whose goal is to make machine learning and data analysis simple, accessible, collaborative and open with an optimal division of labour between computers and humans. People can upload and share data sets and questions (prediction tasks) on OpenML that they then collaboratively solve using machine learning algorithms. We first heard about the Frictionless Data project through School of Data. One of the OpenML core members is also involved in School of Data and used Frictionless Data’s data packages in one of the open data workshops from School of Data Switzerland. We offer open source tools to download data into your favourite machine learning environments and work with it. You can then upload your results back onto the platform so that others can learn from you. If you have data, you can use OpenML to get insights on what machine learning method works well to answer your question. Machine Learners can use OpenML to find interesting data sets and questions that are relevant for others and also for machine learning research (e.g. learning how algorithms behave on different types of data sets).

Image of data set list on OpenML

OpenML currently works with tabular data in Attribute Relation File Format (ARFF) accompanied by metadata in an xml or json file. It is actually very similar to Frictionless Data’s tabular data package specification, but with ARFF instead of csv. 

Image of a data set overview on openML

In the coming months, we are looking to adopt Frictionless Data specifications to improve user friendliness on OpenML. We hope to make it possible for users to upload and connect datasets in data packages format. This will be a great shift because it would enable people to easily build and share machine learning models trained on any dataset in the frictionless data ecosystem. We firmly believe that if data packages become the go-to specification for sharing data in scientific communities, accessibility to data that’s currently ‘hidden’ in data platforms and university libraries will improve vastly, and are keen to adopt and use the specification on OpenML in the coming months. Interested in contributing to OpenML’s quest to adopt the data package specification as an import and export option for data on the OpenML platform? Start here.

Google Funds Frictionless Data Initiative at Open Knowledge

Mor Rubinstein - February 1, 2016 in BigQuery, ckan, Data Package, Google, News, Open Knowledge, Open Knowledge Foundation

We are delighted to announce that Open Knowledge has received funding from Google to work on tool integration for Data Packages as part of our broader work on Frictionless Data to support the open data community.

What are Data Packages?

The funding will support a growing set of tooling around Data Packages.  Data Packages provide functionality for data similar to “packaging” in software and “containerization” in shipping: a simple wrapper and basic structure for the transportation of data that significantly reduces the “friction” and challenges associated with data sharing and integration. Data Packages also support better automation in data processing and do so without imposing major changes on the underlying data being packaged.  As an example, comprehensive country codes is a Data Package which joins together standardized country information from various sources into a single CSV file. The Data Package format, at its simplest level, allows its creator to provide information describing the fields, license, and maintainer of the dataset, all in a machine-readable format. In addition to the basic Data Package format –which supports any data structure– there are other, more specialised Data Package formats: Tabular Data Package for tabular data and based on CSV, Geo Data Package for geodata based on GeoJSON. You can also extend Data Package with your own schemas and create topic-specific Data Packages like Fiscal Data Package for public financial data.   Screen Shot 2016-02-01 at 8.57.44 AM

What will be funded?

The funding supports adding Data Package integration and support to CKAN, BigQuery, and popular open-source SQL relational databases like PostgreSQL and MySQL / MariaDB.

CKAN Integration

CKAN is an open source data management system that is used by many governments and civic organizations to streamline publishing, sharing, finding and using data. This project implements a CKAN extension so that all CKAN datasets are automatically available as Data Packages through the CKAN API. In addition, the extension ensures that the CKAN API natively accepts Tabular Data Package metadata and preserves this information on round-tripping.

BigQuery Integration

This project also creates support for import and export of Tabular Data Packages to BigQuery, Google’s web service querying massive datasets. This involves scripting and a small online service to map Tabular Data Package to BigQuery data definitions. Because Tabular Data Packages already use CSV as the data format, this work focuses on the transformation of data definitions.

General SQL Integration

Finally, general SQL integration is being funded which would cover key open source databases like PostgreSQL and MySQL / MariaDB. This will allow data packages to be natively used in an even wider variety of software that depend on these databases than those listed above. These integrations move us closer to a world of “frictionless data”. For more information about our vision, visit: http://data.okfn.org/. Data OKFN If you have any questions, comments or would like more information, please visit this topic in our OKFN Discuss forum.