You are browsing the archive for Data Quality.

Improving your data publishing workflow with the Frictionless Data Field Guide

- March 27, 2018 in data infrastructures, Data Quality, Frictionless Data

The Frictionless Data Field Guide provides step-by-step instructions for improving data publishing workflows. The field guide introduces new ways of working informed by the Frictionless Data suite of software that data publishers can use independently, or adapt into existing personal and organisational workflows. Data quality and automation of data processing are essential in creating useful and effective data publication workflows. Speed of publication, and lowering costs of publication, are two areas that are directly enhanced by having better tooling and workflows to address quality and automation. At Open Knowledge International, we think that it is important for everybody involved in the publication of data to have access to tools that help automate and improve the quality of data, so this field guide details open data publication approaches with a focus on user-facing tools for anyone interested in publishing data. All of the Frictionless Data tools that are included in this field guide are built with open data publication workflows in mind, with a focus on tabular data, and there is a high degree of flexibility for extended use cases, handling different types of open data. The software featured in this field guide are all open source, maintained by Open Knowledge International under the Frictionless Data umbrella and designed to be modular. The preparation and delivery of the Frictionless Data Field Guide  has been made possible by the Open Data Institute, who received funding from Innovate UK to build “data infrastructure, improve data literacy, stimulate data innovation and build trust in the use of data” under the pubtools programme. Feel free to engage the Frictionless Data team and community on Gitter. The Frictionless Data project is a set of simple specifications to address common data description and data transport issues. The overall aim is to reduce friction in working with data and to do this by making it as easy as possible to transport data between different tools and platforms for further analysis. At the heart of Frictionless Data is the Data Package, which is a simple format for packaging data collections together with a schema and descriptive metadata. For over ten years, the Frictionless Data community has iterated extensively on tools and libraries that address various causes of friction in working with data, and this work culminated in the release of v1 specifications in September 2017.  

Validation for Open Data Portals: a Frictionless Data Case Study

- December 18, 2017 in case study, ckan, Data Quality, Frictionless Data, goodtables

The Frictionless Data project is about making it effortless to transport high quality data among different tools and platforms for further analysis. We are doing this by developing a set of software, specifications, and best practices for publishing data. The heart of Frictionless Data is the Data Package specification, a containerization format for any kind of data based on existing practices for publishing open-source software. Through its pilots, Frictionless Data is working directly with organisations to solve real problems managing data. The University of Pittsburgh’s Center for Urban and Social Research is one such organisation. One of the main goals of the Frictionless Data project is to help improve data quality by providing easy to integrate libraries and services for data validation. We have integrated data validation seamlessly with different backends like GitHub and Amazon S3 via the online service goodtables.io, but we also wanted to explore closer integrations with other platforms. An obvious choice for that are Open Data portals. They are still one of the main forms of dissemination of Open Data, especially for governments and other organizations. They provide a single entry point to data relating to a particular region or thematic area and provide users with tools to discover and access different datasets. On the backend, publishers also have tools available for the validation and publication of datasets. Data quality varies widely across different portals, reflecting the publication processes and requirements of the hosting organizations. In general, it is difficult for users to assess the quality of the data and there is a lack of descriptors for the actual data fields. At the publisher level, while strong emphasis has been put in metadata standards and interoperability, publishers don’t generally have the same help or guidance when dealing with data quality or description. We believe that data quality in Open Data portals can have a central place on both these fronts, user-centric and publisher-centric, and we started this pilot to showcase a possible implementation. To field test our implementation we chose the Western Pennsylvania Regional Data Center (WPRDC), managed by the University of Pittsburgh Center for Urban and Social Research. WPRDC is a great example of a well managed Open Data portal, where datasets are actively maintained and the portal itself is just one component of a wider Open Data strategy. It also provides a good variety of publishers, including public sector agencies, academic institutions, and nonprofit organizations. The portal software that we are using for this pilot is CKAN, the world leading open source software for Open Data portals (source). Open Knowledge International initially fostered the CKAN project and is now a member of the CKAN Association. We created ckanext-validation, a CKAN extension that provides a low level API and readily available features for data validation and reporting that can be added to any CKAN instance. This is powered by goodtables, a library developed by Open Knowledge International to support the validation of tabular datasets. The ckanext-validation extension allows users to perform data validation against any tabular resource, such as  CSV or Excel files. This generates a report that is stored against a particular resource, describing issues found with the data, both at the structural level, such as missing headers and blank rows,  and at the data schema level, such as wrong data types and  out of range values. Read the technical details about this pilot study, our learnings and areas we have identified for further work in the coming days here on the Frictionless Data website.

OKI wins funds from ODI to create Open Data publication toolkit

- October 31, 2017 in Data Quality, Frictionless Data, News, ODI

Open Knowledge International (OKI) has been awarded funds by the Open Data Institute (ODI) as part of a project to enhance and increase adoption of tools and services for open data publishers in the private and public sectors, reducing barriers to publication. OKI’s focus in this programme will be to create better open data publication workflows by building on our earlier work on the Frictionless Data initiative. We will be implementing significant incremental improvements to a range of code libraries and tools that are loosely aligned around our Frictionless Data project, in which we are working on removing the friction in working with data by developing a set of tools, standards, and best practices for publishing data. The work will be presented as part of a new toolkit which will be specifically targeted at both technical and non-technical users of data, within the public sector, businesses, and the data community. We will perform additional user research in government and non-governmental contexts, design and enhance user interfaces for non-technical users, implement integrations of tooling with existing workflows as well as working towards new ones. The reports, research and tools produced will become practical assets that can be used and added to by others, to continue to explore how data can and should work in our societies and economies. Innovate UK, the UK’s innovation agency, is providing £6 million over three years to the ODI, to advance knowledge and expertise in how data can shape the next generation of public and private services, and create economic growth. The work on improving the conditions for data publishing is one of six projects, chosen by the ODI, in this first year of the funding. Olivier Thereaux, Head of Technology at the ODI said:
‘Our goals in this project are to truly understand what barriers exist to publishing high quality data quickly and at reasonable cost. We’re happy to be working with OKI, and to be building on its Frictionless Data initiative to further the development of simpler, faster, higher quality open data publishing workflows. ‘

On announcing the funding on 17th October, Dr Jeni Tennison, CEO at the ODI said:
‘The work we are announcing today will find the best examples of things working well, so we can share and learn from them. We will take these learnings and help businesses and governments to use them and lead by example.’
A major focus for the Product Team at Open Knowledge International over the last two years has been around data quality and automation of data processing. Data quality is arguably the greatest barrier to useful and usable open data and we’ve been directly addressing this via specifications and tooling in Frictionless Data over the last two years. Our focus in this project will be to develop ways for non-technical users to employ tools for automation, reducing the potential for manual error, and increasing productivity. We see speed of publication and lowering costs of publication as two areas that are directly enhanced by having better tooling and workflows to address quality and automation and this is something which the development of this toolkit will directly address. People are fundamental to quality, curated, open data publication workflows. However, by automating more aspects of the “publication pipeline”, we not only reduce the need for manual intervention, we also can increase the speed at which open data can be published.

To keep up to date on our progress, join the Frictionless Data Discuss forum, or ask the team a direct question on the gitter channel.

Open data quality – the next shift in open data?

- May 31, 2017 in Data Quality, Global Open Data Index, GODI16, Open Data

This blog post is part of our Global Open Data Index blog series. It is a call to recalibrate our attention to the many different elements contributing to the ‘good quality’ of open data, the trade-offs between them and how they support data usability (see here some vital work by the World Wide Web Consortium). Focusing on these elements could help support governments to publish data that can be easily used. The blog post was jointly written by Danny Lämmerhirt and Mor Rubinstein.   Some years ago, open data was heralded to unlock information to the public that would otherwise remain closed. In the pre-digital age, information was locked away, and an array of mechanisms was necessary to bridge the knowledge gap between institutions and people. So when the open data movement demanded “Openness By Default”, many data publishers followed the call by releasing vast amounts of data in its existing form to bridge that gap. To date, it seems that opening this data has not reduced but rather shifted and multiplied the barriers to the use of data, as Open Knowledge International’s research around the Global Open Data Index (GODI) 2016/17 shows. Together with data experts and a network of volunteers, our team searched, accessed, and verified more than 1400 government datasets around the world. We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer. As the Open Data Handbook states, these emerging open data infrastructures resemble the myth of the ‘Tower of Babel’: more information is produced, but it is encoded in different languages and forms, preventing data publishers and their publics from communicating with one another. What makes data usable under these circumstances? How can we close the information chain loop? The short answer: by providing ‘good quality’ open data.  

Understanding data quality – from quality to qualities

The open data community needs to shift focus from mass data publication towards an understanding of good data quality. Yet, there is no shared definition what constitutes ‘good’ data quality. Research shows that there are many different interpretations and ways of measuring data quality. They include data interpretability, data accuracy, timeliness of publication, reliability, trustworthiness, accessibility, discoverability, processability, or completeness.  Since people use data for different purposes, certain data qualities matter more to a user group than others. Some of these areas are covered by the Open Data Charter, but the Charter does not explicitly name them as ‘qualities’ which sum up to high quality. Current quality indicators are not complete – and miss the opportunity to highlight quality trade-offs Also, existing indicators assess data quality very differently, potentially framing our language and thinking of data quality in opposite ways. Examples are: Some indicators focus on the content of data portals (number of published datasets) or access to data. A small fraction focus on datasets, their content, structure, understandability, or processability. Even GODI and the Open Data Barometer from the World Wide Web Foundation do not share a common definition of data quality.
 Arguably, the diversity of existing quality indicators prevents from a targeted and strategic approach to improving data quality.

At the moment GODI sets out the following indicators for measuring data quality:
  • Completeness of dataset content
  • Accessibility (access-controlled or public access?)
  • Findability of data
  • Processability (machine-readability and amount of effort needed to use data)
  • Timely publication
This leaves out other qualities. We could ask if data is actually understandable by people. For example, is there a description what each part of the data content means (metadata)?   Improving quality by improving the way data is produced Many data quality metrics are (rightfully so) user-focussed. However, it is critical that government as data producers better understand, monitor and improves the inherent quality of the data they produce. Measuring data quality can incentivise governments to design data for impact: by raising awareness of the quality issues that would make data files otherwise practically impossible to use. At Open Knowledge International, we target data producers and the quality issues of data files mostly via the Frictionless Data project. Notable projects include the Data Quality Spec which defines some essential quality aspects for tabular data files. GoodTables provides structural and schema validation of government data, and the Data Quality Dashboard enables open data stakeholders to see data quality metrics for entire data collections “at a glance”, including the amount of errors in a data file. These tools help to develop a more systematic assessment of the technical processability and usability of data.

A call for joint work towards better data quality

We are aware that good data quality requires solutions jointly working together. Therefore, we would love to hear your feedback. What are your experiences with open data quality? Which quality issues hinder you from using open data? How do you define these data qualities? What could the GODI team improve?  Please let us know by joining the conversation about GODI on our forum.

CKAN extensions Archiver and QA upgraded

- January 27, 2016 in Data Quality, Extensions

Popular CKAN extensions ‘Archiver’ and ‘QA’ have recently been significantly upgraded. Now it is relatively simple to add automatic broken link checking and 5 stars of openness grading to any CKAN site. At a time when many open data portals suffer from quality problems, adding these reports make it easy to identify the problems and get credit when they are resolved. Whilst these extensions have been around for a few years, most of the development has been on forks, whilst the core has been languishing. In the past couple of months there has been a big push to merge all the efforts from US (data.gov), Finland, Greece, Slovakia and Netherlands, and particularly those from UK (data.gov.uk), into core. It’s been a big leap forward in functionality. Now installers no longer need to customize templates – you get details of broken links and 5 stars shown on every dataset simply by installing and configuring the extensions. And now we’re all on the same page, it means we can work together better from now on. ckanext-qa ckanext-archiver The Archiver Extension regularly tries out all datasets’ data links to see if they are still working. File URLs that do work are downloaded and the user is offered the ‘cached’ copy. Otherwise, URLs that are broken are marked in red and listed in a report. See more: ckanext-archiver repo, docs and demo images The QA Extension analyses the data files that Archiver has downloaded to reliably determine their format – CSV, XLS, PDF, etc, rather than trusting the format that the publisher has said they are. This information is combined with the data license and whether the data is currently accessible to give a rating out of 5 according to Tim Berners-Lee’s 5 Stars of Openness. A file that has no open licence, or is not available gets 0 stars. If it passes those tests but is only a PDF then it gets 1 star. A machine-readable but proprietry format like XLS gets it 2 stars, and an open format like CSV gets it 3 stars. 4 and 5 star data is that which uses standard schemas and references other datasets, which tends to mean RDF. See ckanext-qa repo, docs and demo images

CKAN extensions Archiver and QA upgraded

- January 27, 2016 in Data Quality, Extensions

Popular CKAN extensions ‘Archiver’ and ‘QA’ have recently been significantly upgraded. Now it is relatively simple to add automatic broken link checking and 5 stars of openness grading to any CKAN site. At a time when many open data portals suffer from quality problems, adding these reports make it easy to identify the problems and get credit when they are resolved. Whilst these extensions have been around for a few years, most of the development has been on forks, whilst the core has been languishing. In the past couple of months there has been a big push to merge all the efforts from US (data.gov), Finland, Greece, Slovakia and Netherlands, and particularly those from UK (data.gov.uk), into core. It’s been a big leap forward in functionality. Now installers no longer need to customize templates – you get details of broken links and 5 stars shown on every dataset simply by installing and configuring the extensions. And now we’re all on the same page, it means we can work together better from now on. ckanext-qa ckanext-archiver The Archiver Extension regularly tries out all datasets’ data links to see if they are still working. File URLs that do work are downloaded and the user is offered the ‘cached’ copy. Otherwise, URLs that are broken are marked in red and listed in a report. See more: ckanext-archiver repo, docs and demo images The QA Extension analyses the data files that Archiver has downloaded to reliably determine their format – CSV, XLS, PDF, etc, rather than trusting the format that the publisher has said they are. This information is combined with the data license and whether the data is currently accessible to give a rating out of 5 according to Tim Berners-Lee’s 5 Stars of Openness. A file that has no open licence, or is not available gets 0 stars. If it passes those tests but is only a PDF then it gets 1 star. A machine-readable but proprietry format like XLS gets it 2 stars, and an open format like CSV gets it 3 stars. 4 and 5 star data is that which uses standard schemas and references other datasets, which tends to mean RDF. See ckanext-qa repo, docs and demo images