You are browsing the archive for Frictionless Data.

Frictionless Data and FAIR Research Principles

- August 14, 2018 in Data Package, Frictionless Data

In August 2018, Serah Rono will be running a Frictionless Data workshop in CopenHagen, congregated by the Danish National Research Data Management Forum as part of the FAIR Across project. In October 2018, she will also run a Frictionless Data workshop at FORCE11 in Montreal, Canada. Ahead of the two workshops, and other events before the close of 2018, this blog post discusses how the Frictionless Data initiative aligns with FAIR research principles. An integral part of evidence-based research is gathering and analysing data, which takes time and often requires skill and specialized tools to aid the process. Once the work is done, reproducibility requires that research reports be shared with the data and software from which insights are derived and conclusions are drawn, if at all.  Widely lauded as a key measure of research credibility, reproducibility also makes a bold demand for openness by default in research, which in turn fosters collaboration. FAIR (findability, accessibility, interoperability and reusability) research principles are central to the open access and open research movements.
FAIR Guiding Principles precede implementation choices, and do not suggest any specific technology, standard, or implementation-solution; moreover, the Principles are not, themselves, a standard or a specification. They act as a guide to data publishers and stewards to assist them in evaluating whether their particular implementation choices are rendering their digital research artefacts Findable, Accessible, Interoperable, and Reusable.”

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016)

Data Packages in Frictionless Data as an example of FAIRness

Our Frictionless Data project aims to make it effortless to transport high quality data among different tools & platforms for further analysis. The Data Package format is at the core of Frictionless Data, and it makes it possible to package data and attach contextual information to it before sharing it.

An example data package

Data packages are nothing without the descriptor file. This descriptor file is made available in a machine readable format, JSON, and holds metadata for your collection of resources, and a schema for your tabular data.

Findability

In Data Packages, pieces of information are called resources. Each resource is referred to by name and has a globally unique identifier, with the provision to reference remote resources by URLs. Resource names and identifiers are held alongside other metadata in the descriptor file.

Accessibility

Since metadata is held in the descriptor file, it can be accessed separately from associated data. Where resources are available online – in an archive or data platform – sharing the descriptor file only is sufficient and data provenance is guaranteed for all associated resources.

Interoperability

The descriptor file is saved as a JSON file, a machine-readable format that can be processed with great ease by many different tools during data analysis. The descriptor file uses accessible and shared language, and has provision to add descriptions, and information on sources and contributors for each resource, which makes it possible to link to other existing metadata and guarantee data provenance. It is also very extensible, and can be expanded to accommodate additional information as needed.

Reusability

Part of the metadata held in a data package includes licensing and author information, and has a requirement to link back to original sources thus ensuring data provenance. This serves as a great guide for users interested in your resources. Where licensing allows for resources to be archived on different platforms, this means that regardless of where users access this data from, they will be able to trace back to original sources of the data as needed. For example, all countries of the world have unique codes attached to them. See how the Country Codes data package is represented on two different platforms:  GitHub, and on DataHub. With thanks to SLOAN Foundation for the new Frictionless Data For Reproducible Research grant, we will be running deep dive workshops to expound on these concepts and identify areas for improvement and collaboration in open access and open research. We have exciting opportunities in store, which we will announce in our community channels over time.

Bonus readings

Here are some of the ways researchers have adopted Frictionless Data software in different domains over the last two years:
  • The Cell Migration and Standardisation Organisation (CMSO) uses Frictionless Data specs to package cell migration data and load it into Pandas for data analysis and creation of visualizations. Read more.
  • We collaborated with Data Management for TEDDINET project (DM4T) on a proof-of-concept pilot in which we used Frictionless Data software to address some of the data management challenges faced by DM4T. Read more.
  • Open Power System Data uses Frictionless Data specifications to avail energy data for analysis and modeling. Read more.
  • We collaborated with Pacific Northwest National Laboratory – Active Data Biology and explored use of Frictionless Data software to generate schema for tabular data and check validity of metadata stored as part of a biological application on GitHub. Read more.
  • We collaborated with the UK Data service and used Frictionless Data software to assess and report on data quality, and made a case for generating visualisations with ensuing data and metadata. Read more.
Our team is also scheduled to run Frictionless Data workshops in the coming months:
  • In CopenHagen, congregated by the Danish National Research Data Management Forum as part of the FAIR Across project, in August 2018.
  • In Montreal, Canada, at FORCE11 between October 10 and 12, 2018. See the full program here and sign up here to attend the Frictionless Data workshop.

Sloan Foundation Funds Frictionless Data for Reproducible Research

- July 12, 2018 in data infrastructures, Featured, Frictionless Data

We are excited to announce that Open Knowledge International has received a grant of $750,000 from The Alfred P. Sloan Foundation for our project “Frictionless Data for Reproducible Research”. The new funding from Sloan enables us to continue work over the next 3 years via enhanced dissemination and training activities, as well as further iteration on the software and specifications via a range of deep pilot projects with research partners.  
With
Frictionless Data, we focus specifically on reducing friction around discoverability, structure, standardization and tooling. More generally, the technicalities around the preparation, validation and sharing of data, in ways that both enhance existing workflows and enable new ones, towards the express goal of minimizing the gap between data and insight. We do this by creating specifications and software that are primarily informed by reuse (of existing formats and standards), conceptual minimalism, and platform-agnostic interoperability. Over the last two years, with support from Sloan and others, we have validated the utility and usefulness of the Frictionless Data approach for the research community and found strong commonalities between our experiences of data work in the civic tech arena, and the friction encountered in data-driven research. The pilots and case studies we conducted over this period have enabled us to improve our specifications and software, and to engage with a wider network of actors interested in data-driven research from fields as diverse as earth science, computational biology, archeology, and the digital humanities. Building on work going on for nearly a decade, last September we launched v1 of the Frictionless Data specifications, and we have produced core software that implements those specifications across 7 programming languages. With the new grant we will iterate on this work, as well as run additional Tool Fund activities to facilitate deeper integration of the Frictionless Data approach in a range of tools and workflows that enable in reproducible research. A core point of friction in working with data is the discoverability of data. Having a curated collection of well-maintained datasets that are of high value to a given domain of inquiry is an important move towards increasing quality of data-driven research. With this in mind, we will also be organising efforts to curate datasets that are of high-value in the domains we work. This high-value data will serve as a reference for how to package data with Frictionless Data specifications, and provide suitable material for producing domain-specific training materials and guides. Finally, we will be focussing on researchers themselves and are planning a programme to recruit and train early career researchers to become trainers and evangelists of the tools in their field(s). This programme will draw lessons from years of experience running data literacy fellowships with School of Data and Panton Fellowships for OpenScience. We hope to meet researchers where they are and work with them to demonstrate the effectiveness of our approach and how our tools and bring real value to your work. Are you a researcher looking for better tooling to manage your data? Do you work at or represent an organization working on issues related to research and would like to work with us on complementary issues for which data packages are suited? Are you a developer and have an idea for something we can build together? Are you a student looking to learn more about data wrangling, managing research data, or open data in general? We’d love to hear from you.  If you have any other questions or comments about this initiative, please visit this topic in our forum,  hashtag #frictionlessdata or speak to the project team on the public gitter channel.   The Alfred P. Sloan Foundation is a philanthropic, not-for-profit grant-making institution based in New York City. Established in 1934 by Alfred Pritchard Sloan Jr., then-President and Chief Executive Officer of the General Motors Corporation, the Foundation makes grants in support of original research and education in science, technology, engineering, mathematics and economic performance.  

Improving your data publishing workflow with the Frictionless Data Field Guide

- March 27, 2018 in data infrastructures, Data Quality, Frictionless Data

The Frictionless Data Field Guide provides step-by-step instructions for improving data publishing workflows. The field guide introduces new ways of working informed by the Frictionless Data suite of software that data publishers can use independently, or adapt into existing personal and organisational workflows. Data quality and automation of data processing are essential in creating useful and effective data publication workflows. Speed of publication, and lowering costs of publication, are two areas that are directly enhanced by having better tooling and workflows to address quality and automation. At Open Knowledge International, we think that it is important for everybody involved in the publication of data to have access to tools that help automate and improve the quality of data, so this field guide details open data publication approaches with a focus on user-facing tools for anyone interested in publishing data. All of the Frictionless Data tools that are included in this field guide are built with open data publication workflows in mind, with a focus on tabular data, and there is a high degree of flexibility for extended use cases, handling different types of open data. The software featured in this field guide are all open source, maintained by Open Knowledge International under the Frictionless Data umbrella and designed to be modular. The preparation and delivery of the Frictionless Data Field Guide  has been made possible by the Open Data Institute, who received funding from Innovate UK to build “data infrastructure, improve data literacy, stimulate data innovation and build trust in the use of data” under the pubtools programme. Feel free to engage the Frictionless Data team and community on Gitter. The Frictionless Data project is a set of simple specifications to address common data description and data transport issues. The overall aim is to reduce friction in working with data and to do this by making it as easy as possible to transport data between different tools and platforms for further analysis. At the heart of Frictionless Data is the Data Package, which is a simple format for packaging data collections together with a schema and descriptive metadata. For over ten years, the Frictionless Data community has iterated extensively on tools and libraries that address various causes of friction in working with data, and this work culminated in the release of v1 specifications in September 2017.  

Open Belgium 2018: “Open Communities – Smart Society”

- February 14, 2018 in Frictionless Data, OK Belgium, Open Belgium

The next edition of Open Belgium, a community driven conference organised by Open Knowledge Belgium, is almost here! In less than 4 weeks, 300 industry, research, government and citizen stakeholders will gather and discuss current trends around Open Knowledge and Open Data in Belgium. Open Belgium is the ideal place to get an update on local, national and global open initiatives as well as to share skills, expertise and ideas with like minded data enthusiasts. It is an event where IT-experts, local authorities, Open Data hackers, researchers and private companies have the chance to catch up on what is new in the field of Open Knowledge in Belgium and beyond. It’s a day where data publishers sit next to users, citizen developers and communities to network and to openly discuss the next steps in Open Knowledge and Open Data. To make sure that you will get the best out of a full day of talks, workshops, panels, discussions and, not to forget, networking opportunities, we post daily blog posts of all that is going to happen on the 12th of March. Check out the full programme here. From Open Knowledge International, Serah Rono (Developer Advocate) and Vitor Baptista (Engineering Lead) will host the hackathon session “Using Frictionless Data software to turn data into insight”. OKI’s Frictionless Data (frictionlessdata.io) initiative is about making it effortless to transport quality data among different tools & platforms for further analysis. In this session, they will introduce Open Belgium community to software that streamlines their data workflow process and make a case for data quality. You will learn how to add metadata and create schema for their data, validate datasets and be part of a vibrant open source, open data community. Do you want to be part of the open community? Attend talks from excellent speakers? Meet other open experts and interested peers? Find inspiration for your projects? Or just keep the discussion going on #OpenBelgium? Be sure to join on the 12h of March in Louvain-la-Neuve: there are still tickets left here.

Validation for Open Data Portals: a Frictionless Data Case Study

- December 18, 2017 in case study, ckan, Data Quality, Frictionless Data, goodtables

The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software. Through its pilots, Frictionless Data is working directly with organisations to solve real problems managing data. The University of Pittsburgh’s Center for Urban and Social Research is one such organisation. One of the main goals of the Frictionless Data project is to help improve data quality by providing easy to integrate libraries and services for data validation. We have integrated data validation seamlessly with different backends like GitHub and Amazon S3 via the online service goodtables.io, but we also wanted to explore closer integrations with other platforms. An obvious choice for that are Open Data portals. They are still one of the main forms of dissemination of Open Data, especially for governments and other organizations. They provide a single entry point to data relating to a particular region or thematic area and provide users with tools to discover and access different datasets. On the backend, publishers also have tools available for the validation and publication of datasets. Data quality varies widely across different portals, reflecting the publication processes and requirements of the hosting organizations. In general, it is difficult for users to assess the quality of the data and there is a lack of descriptors for the actual data fields. At the publisher level, while strong emphasis has been put in metadata standards and interoperability, publishers don’t generally have the same help or guidance when dealing with data quality or description. We believe that data quality in Open Data portals can have a central place on both these fronts, user-centric and publisher-centric, and we started this pilot to showcase a possible implementation. To field test our implementation we chose the Western Pennsylvania Regional Data Center (WPRDC), managed by the University of Pittsburgh Center for Urban and Social Research. WPRDC is a great example of a well managed Open Data portal, where datasets are actively maintained and the portal itself is just one component of a wider Open Data strategy. It also provides a good variety of publishers, including public sector agencies, academic institutions, and nonprofit organizations. The portal software that we are using for this pilot is CKAN, the world leading open source software for Open Data portals (source). Open Knowledge International initially fostered the CKAN project and is now a member of the CKAN Association. We created ckanext-validation, a CKAN extension that provides a low level API and readily available features for data validation and reporting that can be added to any CKAN instance. This is powered by goodtables, a library developed by Open Knowledge International to support the validation of tabular datasets. The ckanext-validation extension allows users to perform data validation against any tabular resource, such as  CSV or Excel files. This generates a report that is stored against a particular resource, describing issues found with the data, both at the structural level, such as missing headers and blank rows,  and at the data schema level, such as wrong data types and  out of range values. Read the technical details about this pilot study, our learnings and areas we have identified for further work in the coming days here on the Frictionless Data website.

Frictionless Data Case Study: OpenML

- December 6, 2017 in case study, Data Package, Frictionless Data, Open Source

The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software. The Frictionless Data  case study series highlights projects and organisations who are working with Frictionless Data specifications and software in interesting and innovative ways. OpenML is one such organization. This case study has been made possible by OpenML’s Heidi Seibold and Joaquin Vanschoren, the authors of this blog.   OpenML is an online platform and service for machine learning, whose goal is to make machine learning and data analysis simple, accessible, collaborative and open with an optimal division of labour between computers and humans. People can upload and share data sets and questions (prediction tasks) on OpenML that they then collaboratively solve using machine learning algorithms. We first heard about the Frictionless Data project through School of Data. One of the OpenML core members is also involved in School of Data and used Frictionless Data’s data packages in one of the open data workshops from School of Data Switzerland. We offer open source tools to download data into your favourite machine learning environments and work with it. You can then upload your results back onto the platform so that others can learn from you. If you have data, you can use OpenML to get insights on what machine learning method works well to answer your question. Machine Learners can use OpenML to find interesting data sets and questions that are relevant for others and also for machine learning research (e.g. learning how algorithms behave on different types of data sets).

Image of data set list on OpenML

OpenML currently works with tabular data in Attribute Relation File Format (ARFF) accompanied by metadata in an xml or json file. It is actually very similar to Frictionless Data’s tabular data package specification, but with ARFF instead of csv. 

Image of a data set overview on openML

In the coming months, we are looking to adopt Frictionless Data specifications to improve user friendliness on OpenML. We hope to make it possible for users to upload and connect datasets in data packages format. This will be a great shift because it would enable people to easily build and share machine learning models trained on any dataset in the frictionless data ecosystem. We firmly believe that if data packages become the go-to specification for sharing data in scientific communities, accessibility to data that’s currently ‘hidden’ in data platforms and university libraries will improve vastly, and are keen to adopt and use the specification on OpenML in the coming months. Interested in contributing to OpenML’s quest to adopt the data package specification as an import and export option for data on the OpenML platform? Start here.

OKI wins funds from ODI to create Open Data publication toolkit

- October 31, 2017 in Data Quality, Frictionless Data, News, ODI

Open Knowledge International (OKI) has been awarded funds by the Open Data Institute (ODI) as part of a project to enhance and increase adoption of tools and services for open data publishers in the private and public sectors, reducing barriers to publication. OKI’s focus in this programme will be to create better open data publication workflows by building on our earlier work on the Frictionless Data initiative. We will be implementing significant incremental improvements to a range of code libraries and tools that are loosely aligned around our Frictionless Data project, in which we are working on removing the friction in working with data by developing a set of tools, standards, and best practices for publishing data. The work will be presented as part of a new toolkit which will be specifically targeted at both technical and non-technical users of data, within the public sector, businesses, and the data community. We will perform additional user research in government and non-governmental contexts, design and enhance user interfaces for non-technical users, implement integrations of tooling with existing workflows as well as working towards new ones. The reports, research and tools produced will become practical assets that can be used and added to by others, to continue to explore how data can and should work in our societies and economies. Innovate UK, the UK’s innovation agency, is providing £6 million over three years to the ODI, to advance knowledge and expertise in how data can shape the next generation of public and private services, and create economic growth. The work on improving the conditions for data publishing is one of six projects, chosen by the ODI, in this first year of the funding. Olivier Thereaux, Head of Technology at the ODI said:
‘Our goals in this project are to truly understand what barriers exist to publishing high quality data quickly and at reasonable cost. We’re happy to be working with OKI, and to be building on its Frictionless Data initiative to further the development of simpler, faster, higher quality open data publishing workflows. ‘

On announcing the funding on 17th October, Dr Jeni Tennison, CEO at the ODI said:
‘The work we are announcing today will find the best examples of things working well, so we can share and learn from them. We will take these learnings and help businesses and governments to use them and lead by example.’
A major focus for the Product Team at Open Knowledge International over the last two years has been around data quality and automation of data processing. Data quality is arguably the greatest barrier to useful and usable open data and we’ve been directly addressing this via specifications and tooling in Frictionless Data over the last two years. Our focus in this project will be to develop ways for non-technical users to employ tools for automation, reducing the potential for manual error, and increasing productivity. We see speed of publication and lowering costs of publication as two areas that are directly enhanced by having better tooling and workflows to address quality and automation and this is something which the development of this toolkit will directly address. People are fundamental to quality, curated, open data publication workflows. However, by automating more aspects of the “publication pipeline”, we not only reduce the need for manual intervention, we also can increase the speed at which open data can be published.

To keep up to date on our progress, join the Frictionless Data Discuss forum, or ask the team a direct question on the gitter channel.

eLife: Facilitating data validation & reuse with goodtables

- October 25, 2017 in Frictionless Data, goodtables, pilot

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. Through a series of pilots, we are working directly with organizations to solve real problems managing data.  eLife is a non-profit organisation with a mission to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers.  In this blog, Jo Barrat, Adria Mercader and Naomi Penfold share learnings from  a pilot of Frictionless Data’s goodtables on data shared with eLife. eLife is a non-profit organisation with a mission  to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science. eLife publishes important research in all areas of the life and biomedical sciences. The research is selected and evaluated by working scientists and is made freely available to all readers. “Ensuring data availability alone is insufficient for the true potential of open data to be realised. The push from journals and funders at the moment is to encourage sharing, which is the first step towards reuse. The next step is to consider how we ensure actually reusable data. Any efforts to make it easier for researchers to prepare high quality reusable datasets, and to do so with minimal effort, are welcome. Further, tools that reduce the burden of reviewing datasets are of interest to data publishers.”
– 
Naomi Penfold, eLife

Use Case

Data sharing is an important cornerstone in the movement towards more reproducible science: it provides a means to validate assertions made, which is why many journals and funders require that research data is shared publicly and appropriately within a reasonable timeframe following a research project. At eLife, authors are encouraged to deposit their data in an appropriate external repository and to cite the datasets in their article or, where this is not possible or suitable, publish the source data as supplements to the article itself. The data files are then stored in the eLife data store and made available through download links available within the article.

Source data shared with eLife is listed under the Figures and data tab. Source: Screenshot from eLife 2017;6:e29820.

Open research data is an important asset in the record of the original research, and its reuse in different contexts help make the research enterprise more efficient. Sharing and reuse of research data is fairly common, and researchers may reuse others’ data more readily than they might share their own. The exact nature of data reuse however is less clear: forty percent of Wellcome Trust-funded researchers make their data available as open access, and three-quarters report reusing existing data for validation, contextualisation, methodological development, and novel analyses, for example (Van den Eynden et al, 2016). Interestingly, a third of researchers who never publish their own data report reusing other researcher’s open data (Treadway et al, 2016) and dataset citation by researchers other than the original authors appears to be growing at least in line with greater availability of data (for gene expression microarray analysis; Piwowar & Vision, 2013). However, only a minority of citations (6% of 138) pertained to actual data reuse when citation context was manually classified in this study. Indeed, the quality of the data and its documentation were listed as important factors when Wellcome Trust-funded researchers were deciding whether to reuse a dataset or not (Van den Eynden et al, 2016). Very few formal studies that look into the problems faced by researchers when attempting to reuse open data have been published. Anecdotal evidence from conversations with life scientists indicates that:
  1. The process of preparing open data for reuse — including cleaning, restructuring, and comparing multiple datasets prior to combining — is onerous and time-consuming.
    The time and effort it takes for researchers to prepare their own data for repository deposition is considered a barrier to sharing. Further, the quality of the data and its documentation are important factors when deciding whether to reuse a dataset or not. (Van den Eynden et al, 2016)
    This is why projects that improve the reusability of research data in a way that requires minimal effort on the researcher’s part are of interest within the eLife Innovation Initiative.
  2. There is also a sparsity of formal structures for secondary data users to openly collaborate with original data providers, to share the work of improving quality of open research data. Such infrastructure could provide the social and academic feedback cycle in a rapid enough timescale to fuel a rich and useful Open Data ecosystem. While the utility of goodtables does not extend to this use case, it is the first step along this pathway.
These problems are relevant not only to open data in academic research but also to government data. Similarly to moving beyond incentivising sharing data to encouraging sharing reusable data for research, we shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change. Without a simple, solid foundation of structural integrity, schematic consistency, and timely release, we will not meet quality standards higher up in the chain. We need to have essential quality assurances in plain text publication of data first, for data that is published via manual and automated means. We shouldn’t only incentivise governments for raw publication and access. We need to incentivise data quality, towards actual insight and change. For our Frictionless Data pilot work, we analyzed 3910 articles, 1085 of which had data files. The most common format was Microsoft Excel Open XML Format Spreadsheet (xlsx), with 89% of all 4318 files being published on this format. Older versions of Excel and CSV files made up the rest.

A summary of the eLife research articles analysed as part of the Frictionless Data pilot work

In terms of  validation, more than three quarters of the articles analyzed contained at least one invalid file. Following analysis of a sample of the results, the vast majority of the errors appear to be due to the data being presented in aesthetically pleasing tables, using formatting to make particular elements more visually clear, as opposed to a machine-readable format.

Data from Maddox et al. was shared in a machine-readable format (top), and adapted here to demonstrate how such data are often shared in a format that looks nice to the human reader (bottom). Source: Source data The data file is presented as is and adapted from Maddox et al. eLife 2015;4:e04995 under the Creative Commons Attribution License (CC BY 4.0).

This is not limited to the academic field of course, and the tendency to present data in spreadsheets so it is visually appealing is perhaps more prevalent in other areas – perhaps because consumers of the data are even less likely to have the data processed by machines or because the data is collated by people with no experience of having to use it in their work. Work to improve the reusability of research data pushes towards an ideal situation where most data is both machine-readable and human-comprehensible. In general the eLife datasets had better quality than for instance those created by government organisations, where structural issues such as missing headers and extra cells are much more common. So although the results here have been good, the community may derive greater benefit from researchers going that extra mile to make files more machine-friendly and embrace more robust data description techniques like Data Packages. Overall, the findings from this pilot demonstrate that there are different ways of producing data for sharing: datasets are predominantly presented in an Excel file with human aesthetics in mind, rather than structured for use by a statistical program. We found few issues with the data itself beyond presentation preferences. This is encouraging and is a great starting point for venturing forward with helping researchers to make greater use of open data. You can read more about this work in the Frictionless Data Pilot writeup. Parts of this piece are cross-posted on eLife Labs.

Frictionless Data v1.0

- September 5, 2017 in Frictionless Data, Open Data

  Data Containerisation hits v1.0! Announcing a major milestone in the Frictionless Data initiative. Today, we’re announcing a major milestone in the Frictionless Data initiative with the official v1.0 release of the Frictionless Data specifications, including Table Schema and Data Package, along with a robust set of pre-built tooling in Python, R, Javascript, Java, PHP and Go. Frictionless Data is a collection of lightweight specifications and tooling for effortless collection, sharing, and validation of data. After close to 10 years of iterative work on the specifications themselves, and the last 6 months of fine-tuning v1.0 release candidates, we are delighted to announce the availability of the following: We want to thank our funder, the Sloan Foundation, for making this release possible.

What’s inside

A brief overview of the main specifications follows. Further information is available on the specifications website.
  • Table Schema: Provides a schema for tabular data. Table Schema is well suited for use cases around handling and validating tabular data in plain text formats, and use cases that benefit from a portable, language agnostic schema format.
  • CSV Dialect: Provides a way to declare a dialect for CSV files.
  • Data Resource: Provides metadata for a data source in a consistent and machine-readable manner.
  • Data Package: Provide metadata for a collection of data sources in a consistent and machine-readable manner.
The specifications, and the code libraries that implement them, compose to form building blocks for working with data, as illustrated with the following diagram. This component based approach lends itself well to the type of data processing work we often encounter in working with open data. It has also enabled us to build higher-level applications that specifically target common open data workflows, such as our goodtables library for data validation, and our pipelines library for declarative ETL.

v1.0 work

In iterating towards a v1 of the specifications, we tried to sharpen our focus on the design philosophy of this work, and not be afraid to make significant, breaking changes in the name of increased simplicity and utility. What is the design philosophy behind this work, exactly?
  • Requirements that are driven by simplicity
  • Extensibility and customisation by design
  • Metadata that is human-editable and machine-usable
  • Reuse of existing standard formats for data
  • Language-, technology- and infrastructure-agnostic
In striving for these goals, we removed much ambiguity from the specifications, cut features that were under-defined, removed and reduced various types of optionality in the way things could be specified, and even made some implicit patterns explicit by way of creating two new specifications: Data Resource and Tabular Data Resource. See the specifications website for full information.

Next steps

We are preparing to submit Table Schema, Data Resource, Data Package, Tabular Data Resource and Tabular Data Package as IETF RFCs as soon as possible. Lastly, we’ve recently produced a video to explain our work on Frictionless Data. Here, you can get a high-level overview of the concepts and philosophy behind this work, presented by our President and Co-Founder Rufus Pollock.  

Data-cards – a design pattern

- August 15, 2017 in Frictionless Data, Open Knowledge

Cross-posted on smth.uk
It can be useful to recognise patterns in the challenges we face, and in our responses to those challenges. In doing this, we can build a library of solutions, a useful resource when similar challenges arise in the future. When working on innovative projects, as is often the case at Open Knowledge International, creating brand new challenges is inevitable. With little or no historical reference material on how best to tackle these challenges, paying attention to your own repeatable solutions becomes even more valuable. From a user interface design point of view, these solutions come in the form of design patterns – reusable solutions to commonly occurring problems. Identifying, and using design patterns can help create familiar processes for users; and by not reinventing the wheel, you can save time in production too. In our work on Data Packages, we are introducing a new task into the world – creating those data packages. This task can be quite simple, and it will ultimately be time saving for people working with data. That said, there is no escaping the fact that this is a task that has never before been asked of people, one that will need to be done repeatedly, and potentially, from within any number of interfaces. It has been my task of late to design some of these interfaces; I’d like to highlight one pattern that is starting to emerge – the process of describing, or adding metadata to, the columns of a data table. I was first faced with this challenge when working on OS Packager. The objective was to present a recognisable representation of the columns, and facilitate the addition of metadata for each of those columns. The adding of data would be relatively straight forward, a few form fields. The challenge lay in helping the user to recognise those columns from the tables they originated. As anyone who works with spreadsheets on a regular basis will know, they aren’t often predictably or uniformly structured, meaning it is not always obvious what you’re looking at. Take them out of the familiar context of the application they were created in, and this problem could get worse. For this reason, just pulling a table header is probably not sufficient to identify a column. We wanted to provide a preview of the data, to give the best chance of it being recognisable. In addition to this, I felt it important to keep the layout as close as possible to that of say Excel. The simplest solution would be to take the first few rows of the table, and put a form under each column, for the user to add their metadata.     This is a good start, about as recognisable and familiar as you’re going to get. There is one obvious problem though, this could extend well beyond the edge of the users screen, leading to an awkward navigating experience. For an app aimed at desktop users, horizontal scrolling, in any of its forms, would be problematic. So, in the spirit of the good ol’ webpage, let’s make this thing wrap. That is to say that when an element can not fit on the screen, it moves to a new “line”. When doing this we’ll need some vertical spacing where this new line occurs, to make it clear that one column is separate from the one above it. We then need horizontal spacing to prevent the false impression of grouping created by the rows.     The data-card was born. At the time of writing it is utilised in OS Packager, pretty closely resembling the above sketch.     Data Packagist is another application that creates data packages, and it faces the same challenges as described above. When I got involved in this project there was already a working prototype, I saw in this prototype data cards beginning to emerge. It struck me that if these elements followed the same data card pattern created for OS Packager, they could benefit in two significant ways. The layout and data preview would again allow the user to more easily recognise the columns from their spreadsheet; plus the grid layout would lend itself well to drag and drop, which would mean avoiding multiple clicks (of the arrows in the screenshot above) when reordering. I incorporated this pattern into the design.     Before building this new front-end, I extracted what I believe to be the essence of the data-card from the OS Packager code, to reuse in Data Packagist, and potentially future projects. While doing so I thought about the current and potential future uses, and the other functions useful to perform at the same time as adding metadata. Many of these will be unique to each app, but there are a couple that I believe likely to be recurring:
  • Reorder the columns
  • Remove / ignore a column
These features combine with those of the previous iteration to create this stand-alone data-card project: Time will tell how useful this code will be for future work, but as I was able to use it wholesale (changing little more than a colour variable) in the implementation of the Data Packagist front-end, it came at virtually no additional cost. More important than the code however, is having this design pattern as a template, to solve this problem when it arises again in the future.