You are browsing the archive for Open Standards.

What data counts in Europe? Towards a public debate on Europe’s high value data and the PSI Directive

- January 16, 2019 in Open Government Data, Open Standards, Policy, research

This blogpost was co-authored by Danny Lämmerhirt and Pierre Chrzanowski (*author note at the bottom) January 22 will mark a crucial moment for the future of open data in Europe. That day, the final trilogue between European Commission, Parliament, and Council is planned to decide over the ratification of the updated PSI Directive. Among others, the European institutions will decide over what counts as ‘high value’ data. What essential information should be made available to the public and how those data infrastructures should be funded and managed are critical questions for the future of the EU. As we will discuss below, there are many ways one might envision the collective ‘value’ of those data. This is a democratic question and we should not be satisfied by an ill and broadly defined proposal. We therefore propose to organise a public debate to collectively define what counts as high value data in Europe.

What does PSI Directive say about high value datasets?  

The European Commission provides several hints in the current revision of the PSI Directive on how it envisions high value datasets. They are determined by one of the following ‘value indicators’:
  • The potential to generate significant social, economic, or environmental benefits,
  • The potential to generate innovative services,
  • The number of users, in particular SMEs,  
  • The revenues they may help generate,  
  • The data’s potential for being combined with other datasets
  • The expected impact on the competitive situation of public undertakings.
Given the strategic role of open data for Europe’s Digital Single Market, these indicators are not surprising. But as we will discuss below, there are several challenges defining them. Also, there are different ways of understanding the importance of data. The annex of the PSI Directive also includes a list of preliminary high value data, drawing primarily from the key datasets defined by Open Knowledge International’s (OKI’s) Global Open Data Index, as well as the G8 Open Data Charter Technical Annex. See the proposed list in the table below. List of categories and high-value datasets:
Category Description
1. Geospatial Data Postcodes, national and local maps (cadastral, topographic, marine, administrative boundaries).
2. Earth observation and environment Space and situ data (monitoring of the weather and of the quality of land and water, seismicity, energy consumption, the energy performance of buildings and emission levels).
3. Meteorological data Weather forecasts, rain, wind and atmospheric pressure.
4. Statistics National, regional and local statistical data with main demographic and economic indicators (gross domestic product, age, unemployment, income, education).
5. Companies Company and business registers (list of registered companies, ownership and management data, registration identifiers).
6. Transport data Public transport timetables of all modes of transport, information on public works and the state of the transport network including traffic information.
  According to the proposal, regardless of who provide them, these datasets shall be available for free, machine-readable and accessible for download, and where appropriate, via APIs. The conditions for re-use shall be compatible with open standard licences.

Towards a public debate on high value datasets at EU level

There has been attempts by EU Member States to define what constitutes high-value data at national level, with different results. In Denmark, basic data has been defined as the five core information public authorities use in their day-to-day case processing and should release. In France, the law for a Digital Republic aims to make available reference datasets that have the greatest economic and social impact. In Estonia, the country relies on the X-Road infrastructure to connect core public information systems, but most of the data remains restricted. Now is the time for a shared and common definition on what constitute high-value datasets at EU level. And this implies an agreement on how we should define them. However, as it stands, there are several issues with the value indicators that the European Commission proposes. For example, how does one define the data’s potential for innovative services? How to confidently attribute revenue gains to the use of open data? How does one assess and compare the social, economic, and environmental benefits of opening up data? Anyone designing these indicators must be very cautious, as metrics to compare social, economic, and environmental benefits may come with methodical biases. Research found for example, that comparing economic and environmental benefits can unfairly favour data of economic value at the expense of fuzzier social benefits, as economic benefits are often more easily quantifiable and definable by default. One form of debating high value datasets could be to discuss what data gets currently published by governments and why. For instance, with their Global Open Data Index, Open Knowledge International has long advocated for the publication of disaggregated, transactional spending figures. Another example is OKI’s Open Data For Tax Justice initiative which wanted to influence the requirements for multinational companies to report their activities in each country (so-called ‘Country-By-Country-Reporting’), and influence a standard for publicly accessible key data.   A public debate of high value data should critically examine the European Commission’s considerations regarding the distortion of competition. What market dynamics are engendered by opening up data? To what extent do existing markets rely on scarce and closed information? Does closed data bring about market failure, as some argue (Zinnbauer 2018)? Could it otherwise hamper fair price mechanisms (for a discussion of these dynamics in open access publishing, see Lawson and Gray 2016)? How would open data change existing market dynamics? What actors proclaim that opening data could purport market distortion, and whose interests do they represent? Lastly, the European Commission does not yet consider cases of government agencies  generating revenue from selling particularly valuable data. The Dutch national company register has for a long time been such a case, as has the German Weather Service. Beyond considering competition, a public debate around high value data should take into account how marginal cost recovery regimes currently work.

What we want to achieve

For these reasons, we want to organise a public discussion to collectively define
  1. i) What should count as a high value datasets, and based on what criteria,
  2. ii) What information high value datasets should include,
  3. ii) What the conditions for access and re-use should be.
The PSI Directive will set the baseline for open data policies across the EU. We are therefore at a critical moment to define what European societies value as key public information. What is at stake is not only a question of economic impact, but the question of how to democratise European institutions, and the role the public can play in determining what data should be opened.

How you can participate

  1. We will use the Open Knowledge forum as main channel for coordination, exchange of information and debate. To join the debate, please add your thoughts to this thread or feel free to start a new discussion for specific topics.
  2. We gather proposals for high value datasets in this spreadsheet. Please feel free to use it as a discussion document, where we can crowdsource alternative ways of valuing data.
  3. We use the PSI Directive Data Census to assess the openness of high value datasets.
We also welcome any reference to scientific paper, blogpost, etc. discussing the issue of high-value datasets. Once we have gathered suggestions for high value datasets, we would like to assess how open proposed high-value datasets are. This will help to provide European countries with a diagnosis of the openness of key data.     Author note: Danny Lämmerhirt is senior researcher on open data, data governance, data commons as well as metrics to improve open governance. He has formerly worked with Open Knowledge International, where he led its research activities, including the methodology development of the Global Open Data Index 2016/17. His work focuses, among others, on the role of metrics for open government, and the effects metrics have on the way institutions work and make decisions. He has supervised and edited several pieces on this topic, including the Open Data Charter’s Measurement Guide. Pierre Chrzanowski is Data Specialist with the World Bank Group and a co-founder of Open Knowledge France local group. As part of his work, he developed the Open Data for Resilience Initiative (OpenDRI) Index, a tool to assess the openness of key datasets for disaster risk management projects. He has also participated in the impact assessment prior to the new PSI Directive proposal and has contributed to the Global Open Data Index as well as the Web Foundation’s Open Data Barometer.

Europe’s proposed PSI Directive: A good baseline for future open data policies?

- June 21, 2018 in eu, licence, Open Data, Open Government Data, Open Standards, Policy, PSI, research

Some weeks ago, the European Commission proposed an update of the PSI Directive**. The PSI Directive regulates the reuse of public sector information (including administrative government data), and has important consequences for the development of Europe’s open data policies. Like every legislative proposal, the PSI Directive proposal is open for public feedback until July 13. In this blog post Open Knowledge International presents what we think are necessary improvements to make the PSI Directive fit for Europe’s Digital Single Market.    In a guest blogpost Ton Zijlstra outlined the changes to the PSI Directive. Another blog post by Ton Zijlstra and Katleen Janssen helps to understand the historical background and puts the changes into context. Whilst improvements are made, we think the current proposal is a missed opportunity, does not support the creation of a Digital Single Market and can pose risks for open data. In what follows, we recommend changes to the European Parliament and the European Council. We also discuss actions civil society may take to engage with the directive in the future, and explain the reasoning behind our recommendations.

Recommendations to improve the PSI Directive

Based on our assessment, we urge the European Parliament and the Council to amend the proposed PSI Directive to ensure the following:
  • When defining high-value datasets, the PSI Directive should not rule out data generated under market conditions. A stronger requirement must be added to Article 13 to make assessments of economic costs transparent, and weigh them against broader societal benefits.
  • The public must have access to the methods, meeting notes, and consultations to define high value data. Article 13 must ensure that the public will be able to participate in this definition process to gather multiple viewpoints and limit the risks of biased value assessments.
  • Beyond tracking proposals for high-value datasets in the EU’s Interinstitutional Register of Delegated Acts, the public should be able to suggest new delegated acts for high-value datasets.  
  • The PSI Directive must make clear what “standard open licences” are, by referencing the Open Definition, and explicitly recommending the adoption of Open Definition compliant licences (from Creative Commons and Open Data Commons) when developing new open data policies. The directive should give preference to public domain dedication and attribution licences in accordance with the LAPSI 2.0 licensing guidelines.
  • Government of EU member states that already have policies on specific licences in use should be required to add legal compatibility tests with other open licences to these policies. We suggest to follow the recommendations outlined in the LAPSI 2.0 resources to run such compatibility tests.
  • High-value datasets must be reusable with the least restrictions possible, subject at most to requirements that preserve provenance and openness. Currently the European Commission risks to create use silos if governments will be allowed to add “any restrictions on re-use” to the use terms of high-value datasets.  
  • Publicly funded undertakings should only be able to charge marginal costs.
  • Public undertakings, publicly funded research facilities and non-executive government branches should be required to publish data referenced in the PSI Directive.

Conformant licences according to the Open Definition, opendefinition.org/licenses

Our recommendations do not pose unworkable requirements or disproportionately high administrative burden, but are essential to realise the goals of the PSI directive with regards to:
  1. Increasing the amount of public sector data available to the public for re-use,
  2. Harmonising the conditions for non-discrimination, and re-use in the European market,
  3. Ensuring fair competition and easy access to markets based on public sector information,
  4. Enhancing cross-border innovation, and an internal market where Union-wide services can be created to support the European data economy.

Our recommendations, explained: What would the proposed PSI Directive mean for the future of open data?

Publication of high-value data

The European Commission proposes to define a list of ‘high value datasets’ that shall be published under the terms of the PSI Directive. This includes to publish datasets in machine-readable formats, under standard open licences, in many cases free of charge, except when high-value datasets are collected by public undertakings in environments where free access to data would distort competition. “High value datasets” are defined as documents that bring socio-economic benefits, “notably because of their suitability for the creation of value-added services and applications, and the number of potential beneficiaries of the value-added services and applications based on these datasets”. The EC also makes reference to existing high value datasets, such as the list of key data defined by the G8 Open Data Charter. Identifying high-quality data poses at least three problems:
  1. High-value datasets may be unusable in a digital Single Market: The EC may “define other applicable modalities”, such as “any conditions for re-use”. There is a risk that a list of EU-wide high value datasets also includes use restrictions violating the Open Definition. Given that a list of high value datasets will be transposed by all member states, adding “any conditions” may significantly hinder the reusability and ability to combine datasets.
  2. Defining value of data is not straightforward. Recent papers, from Oxford University, to Open Data Watch and the Global Partnership for Sustainable Development Data demonstrate disagreement what data’s “value” is. What counts as high value data should not only be based on quantitative indicators such as growth indicators, numbers of apps or numbers of beneficiaries, but use qualitative assessments and expert judgement from multiple disciplines.
  3. Public deliberation and participation is key to define high value data and to avoid biased value assessments. Impact assessments and cost-benefit calculations come with their own methodical biases, and can unfairly favour data with economic value at the expense of fuzzier social benefits. Currently, the PSI Directive does not consider data created under market conditions to be considered high value data if this would distort market conditions. We recommend that the PSI Directive adds a stronger requirement to weigh economic costs against societal benefits, drawing from multiple assessment methods (see point 2). The criteria, methods, and processes to determine high value must be transparent and accessible to the broader public to enable the public to negotiate benefits and to reflect the viewpoints of many stakeholders.

Expansion of scope

The new PSI Directive takes into account data from “public undertakings”. This includes services in the general interest entrusted with entities outside of the public sector, over which government maintains a high degree of control. The PSI Directive also includes data from non-executive government branches (i.e. from legislative and judiciary branches of governments), as well as data from publicly funded research. Opportunities and challenges include:
  • None of the data holders which are planned to be included in the PSI Directive are obliged to publish data. It is at their discretion to publish data. Only in case they want to publish data, they should follow the guidelines of the proposed PSI directive.
  • The PSI Directive wants to keep administrative costs low. All above mentioned data sectors are exempt from data access requests.
  • In summary, the proposed PSI Directive leaves too much space for individual choice to publish data and has no “teeth”. To accelerate the publication of general interest data, the PSI Directive should oblige data holders to publish data. Waiting several years to make the publication of this data mandatory, as happened with the first version of the PSI Directive risks to significantly hamper the availability of key data, important for the acceleration of growth in Europe’s data economy.    
  • For research data in particular, only data that is already published should fall under the new directive. Even though the PSI Directive will require member states to develop open access policies, the implementation thereof should be built upon the EU’s recommendations for open access.

Legal incompatibilities may jeopardise the Digital Single Market

Most notably, the proposed PSI Directive does not address problems around licensing which are a major impediment for Europe’s Digital Single Market. Europe’s data economy can only benefit from open data if licence terms are standardised. This allows data from different member states to be combined without legal issues, and enables to combine datasets, create cross-country applications, and spark innovation. Europe’s licensing ecosystem is a patchwork of many (possibly conflicting) terms, creating use silos and legal uncertainty. But the current proposal does not only speak vaguely about standard open licences, and makes national policies responsible to add “less restrictive terms than those outlined in the PSI Directive”. It also contradicts its aim to smoothen the digital Single Market encouraging the creation of bespoke licences, suggesting that governments may add new licence terms with regards to real-time data publication. Currently the PSI Directive would allow the European Commission to add “any conditions for re-use” to high-value datasets, thereby encouraging to create legal incompatibilities (see Article 13 (4.a)). We strongly recommend that the PSI Directive draws on the EU co-funded LAPSI 2.0 recommendations to understand licence incompatibilities and ensure a compatible open licence ecosystem.   I’d like to thank Pierre Chrzanowksi, Mika Honkanen, Susanna Ånäs, and Sander van der Waal for their thoughtful comments while writing this blogpost.   Image adapted from Max Pixel   ** Its’ official name is the Directive 2003/98/EC on the reuse of public sector information.

Europe’s proposed PSI Directive: A good baseline for future open data policies?

- June 21, 2018 in eu, licence, Open Data, Open Government Data, Open Standards, Policy, PSI, research

Some weeks ago, the European Commission proposed an update of the PSI Directive**. The PSI Directive regulates the reuse of public sector information (including administrative government data), and has important consequences for the development of Europe’s open data policies. Like every legislative proposal, the PSI Directive proposal is open for public feedback until July 13. In this blog post Open Knowledge International presents what we think are necessary improvements to make the PSI Directive fit for Europe’s Digital Single Market.    In a guest blogpost Ton Zijlstra outlined the changes to the PSI Directive. Another blog post by Ton Zijlstra and Katleen Janssen helps to understand the historical background and puts the changes into context. Whilst improvements are made, we think the current proposal is a missed opportunity, does not support the creation of a Digital Single Market and can pose risks for open data. In what follows, we recommend changes to the European Parliament and the European Council. We also discuss actions civil society may take to engage with the directive in the future, and explain the reasoning behind our recommendations.

Recommendations to improve the PSI Directive

Based on our assessment, we urge the European Parliament and the Council to amend the proposed PSI Directive to ensure the following:
  • When defining high-value datasets, the PSI Directive should not rule out data generated under market conditions. A stronger requirement must be added to Article 13 to make assessments of economic costs transparent, and weigh them against broader societal benefits.
  • The public must have access to the methods, meeting notes, and consultations to define high value data. Article 13 must ensure that the public will be able to participate in this definition process to gather multiple viewpoints and limit the risks of biased value assessments.
  • Beyond tracking proposals for high-value datasets in the EU’s Interinstitutional Register of Delegated Acts, the public should be able to suggest new delegated acts for high-value datasets.  
  • The PSI Directive must make clear what “standard open licences” are, by referencing the Open Definition, and explicitly recommending the adoption of Open Definition compliant licences (from Creative Commons and Open Data Commons) when developing new open data policies. The directive should give preference to public domain dedication and attribution licences in accordance with the LAPSI 2.0 licensing guidelines.
  • Government of EU member states that already have policies on specific licences in use should be required to add legal compatibility tests with other open licences to these policies. We suggest to follow the recommendations outlined in the LAPSI 2.0 resources to run such compatibility tests.
  • High-value datasets must be reusable with the least restrictions possible, subject at most to requirements that preserve provenance and openness. Currently the European Commission risks to create use silos if governments will be allowed to add “any restrictions on re-use” to the use terms of high-value datasets.  
  • Publicly funded undertakings should only be able to charge marginal costs.
  • Public undertakings, publicly funded research facilities and non-executive government branches should be required to publish data referenced in the PSI Directive.

Conformant licences according to the Open Definition, opendefinition.org/licenses

Our recommendations do not pose unworkable requirements or disproportionately high administrative burden, but are essential to realise the goals of the PSI directive with regards to:
  1. Increasing the amount of public sector data available to the public for re-use,
  2. Harmonising the conditions for non-discrimination, and re-use in the European market,
  3. Ensuring fair competition and easy access to markets based on public sector information,
  4. Enhancing cross-border innovation, and an internal market where Union-wide services can be created to support the European data economy.

Our recommendations, explained: What would the proposed PSI Directive mean for the future of open data?

Publication of high-value data

The European Commission proposes to define a list of ‘high value datasets’ that shall be published under the terms of the PSI Directive. This includes to publish datasets in machine-readable formats, under standard open licences, in many cases free of charge, except when high-value datasets are collected by public undertakings in environments where free access to data would distort competition. “High value datasets” are defined as documents that bring socio-economic benefits, “notably because of their suitability for the creation of value-added services and applications, and the number of potential beneficiaries of the value-added services and applications based on these datasets”. The EC also makes reference to existing high value datasets, such as the list of key data defined by the G8 Open Data Charter. Identifying high-quality data poses at least three problems:
  1. High-value datasets may be unusable in a digital Single Market: The EC may “define other applicable modalities”, such as “any conditions for re-use”. There is a risk that a list of EU-wide high value datasets also includes use restrictions violating the Open Definition. Given that a list of high value datasets will be transposed by all member states, adding “any conditions” may significantly hinder the reusability and ability to combine datasets.
  2. Defining value of data is not straightforward. Recent papers, from Oxford University, to Open Data Watch and the Global Partnership for Sustainable Development Data demonstrate disagreement what data’s “value” is. What counts as high value data should not only be based on quantitative indicators such as growth indicators, numbers of apps or numbers of beneficiaries, but use qualitative assessments and expert judgement from multiple disciplines.
  3. Public deliberation and participation is key to define high value data and to avoid biased value assessments. Impact assessments and cost-benefit calculations come with their own methodical biases, and can unfairly favour data with economic value at the expense of fuzzier social benefits. Currently, the PSI Directive does not consider data created under market conditions to be considered high value data if this would distort market conditions. We recommend that the PSI Directive adds a stronger requirement to weigh economic costs against societal benefits, drawing from multiple assessment methods (see point 2). The criteria, methods, and processes to determine high value must be transparent and accessible to the broader public to enable the public to negotiate benefits and to reflect the viewpoints of many stakeholders.

Expansion of scope

The new PSI Directive takes into account data from “public undertakings”. This includes services in the general interest entrusted with entities outside of the public sector, over which government maintains a high degree of control. The PSI Directive also includes data from non-executive government branches (i.e. from legislative and judiciary branches of governments), as well as data from publicly funded research. Opportunities and challenges include:
  • None of the data holders which are planned to be included in the PSI Directive are obliged to publish data. It is at their discretion to publish data. Only in case they want to publish data, they should follow the guidelines of the proposed PSI directive.
  • The PSI Directive wants to keep administrative costs low. All above mentioned data sectors are exempt from data access requests.
  • In summary, the proposed PSI Directive leaves too much space for individual choice to publish data and has no “teeth”. To accelerate the publication of general interest data, the PSI Directive should oblige data holders to publish data. Waiting several years to make the publication of this data mandatory, as happened with the first version of the PSI Directive risks to significantly hamper the availability of key data, important for the acceleration of growth in Europe’s data economy.    
  • For research data in particular, only data that is already published should fall under the new directive. Even though the PSI Directive will require member states to develop open access policies, the implementation thereof should be built upon the EU’s recommendations for open access.

Legal incompatibilities may jeopardise the Digital Single Market

Most notably, the proposed PSI Directive does not address problems around licensing which are a major impediment for Europe’s Digital Single Market. Europe’s data economy can only benefit from open data if licence terms are standardised. This allows data from different member states to be combined without legal issues, and enables to combine datasets, create cross-country applications, and spark innovation. Europe’s licensing ecosystem is a patchwork of many (possibly conflicting) terms, creating use silos and legal uncertainty. But the current proposal does not only speak vaguely about standard open licences, and makes national policies responsible to add “less restrictive terms than those outlined in the PSI Directive”. It also contradicts its aim to smoothen the digital Single Market encouraging the creation of bespoke licences, suggesting that governments may add new licence terms with regards to real-time data publication. Currently the PSI Directive would allow the European Commission to add “any conditions for re-use” to high-value datasets, thereby encouraging to create legal incompatibilities (see Article 13 (4.a)). We strongly recommend that the PSI Directive draws on the EU co-funded LAPSI 2.0 recommendations to understand licence incompatibilities and ensure a compatible open licence ecosystem.   I’d like to thank Pierre Chrzanowksi, Mika Honkanen, Susanna Ånäs, and Sander van der Waal for their thoughtful comments while writing this blogpost.   Image adapted from Max Pixel   ** Its’ official name is the Directive 2003/98/EC on the reuse of public sector information.

Git for Data Analysis – why version control is essential for collaboration and for gaining public trust.

- November 29, 2016 in Featured, Frictionless Data, Open Data, Open Research, Open Science, Open Software, Open Standards

Openness and collaboration go hand in hand. Scientists at PNNL are working with the Frictionless Data team at Open Knowledge International to ensure collaboration on data analysis is seamless and their data integrity is maintained. I’m a computational biologist at the Pacific Northwest National Laboratory (PNNL), where I work on environmental and biomedical research. In our scientific endeavors, the full data life cycle typically involves new algorithms, data analysis and data management. One of the unique aspects of PNNL as a U.S. Department of Energy National Laboratory is that part of our mission is to be a resource to the scientific community. In this highly collaborative atmosphere, we are continuously engaging research partners around the country and around the world.

collaborationImage credit: unsplash (public domain)

One of my recent research topics is how to make collaborative data analysis more efficient and more impactful. In most of my collaborations, I work with other scientists to analyze their data and look for evidence that supports or rejects a hypothesis. Because of my background in computer science, I saw many similarities between collaborative data analysis and collaborative software engineering. This led me to wonder, “We use version control for all our software products. Why don’t we use version control for data analysis?” This thought inspired my current project and has prompted other open data advocates like Open Knowledge International to propose source control for data. Openness is a foundational principle of collaboration. To work effectively as a team, people need to be able to easily see and replicate each other’s work. In software engineering, this is facilitated by version control systems like Git or SVN. Version control has been around for decades and almost all best practices for collaborative software engineering explicitly require version control for complete sharing of source code within the development team. At the moment we don’t have a similarly ubiquitous framework for full sharing in data analysis or scientific investigation. To help create this resource, we started Active Data Biology. Although the tool is still in beta-release, it lays the groundwork for open collaboration. customizationwithactivedata The original use case for Active Data Biology is to facilitate data analysis of gene expression measurements of biological samples. For example, we use the tool to investigate the changing interaction of a bacterial community over time; another great example is the analysis of global protein abundance in a collection of ovarian tumors. In both of these experiments, the fundamental data consist of two tables: 1) a matrix of gene expression values for each sample; 2) a table of metadata describing each sample. Although the original instrument files used to generate these two simple tables are often hundreds of gigabytes, the actual tables are relatively small.

To work effectively as a team, people need to be able to easily see and replicate each other’s work.

After generating data, the real goal of the experiment is to discover something profoundly new and useful – for example how bacteria growth changes over time or what proteins are correlated with surviving cancer. Such broad questions typically involve a diverse team of scientists and a lengthy and rigorous investigation. Active Data Biology uses version control as an underlying technology to ease collaboration between these large and diverse groups. stalemateActive Data Biology creates a repository for each data analysis project. Inside the repository live the data, analysis software, and derived insight. Just as in software engineering, the repository is shared by various team members and analyses are versioned and tracked over time. Although the framework we describe here was created for our specific biological data application, it is possible to generalize the idea and adapt it to many different domains. An example repository can be found here. This dataset originates from a proteomics study of ovarian cancer. In total, 174 tumors were analyzed to identify the abundance of several thousand proteins. The protein abundance data is located in this repository. In order to more easily analyze this with our R based statistical code, we also store the data in an Rdata file (data.Rdata). Associated with this data file is a metadata table which describes the tumor samples, e.g. age of the patient, tumor stage, chemotherapy status, etc. It can be found at metadata.tsv (For full disclosure, and to calm any worries, all of the samples have been de-identified and the data is approved for public release.) Data analysis is an exploration of data, an attempt to uncover some nugget which confirms a hypothesis. Data analysis can take many forms. For me it often involves statistical tests which calculate the likelihood of an observation. For example, we observe that a set of genes which have a correlated expression pattern and are enriched in a biological process. What is the chance that this observation is random? To answer this, we use a statistical test (e.g. a Fisher’s exact test). As the specific implementation might vary from person to person, having access to the exact code is essential. There is no “half-way” sharing here. It does no good to describe analyses over the phone or through email; your collaborators need your actual data and code. In Active Data Biology, analysis scripts are kept in the repository. This repository had a fairly simple scope for statistical analysis. The various code snippets handled data ingress, dealt with missing data (a very common occurrence in environmental or biomedical data), performed a standard test and returned the result. Over time, these scripts may evolve and change. This is exactly why we chose to use version control, to effortlessly track and share progress on the project. We should note that we are not the only ones using version control in this manner. Open Knowledge International has a large number of GitHub repositories hosting public datasets, such as atmospheric carbon dioxide time series measurements. Vanessa Bailey and Ben Bond-Lamberty, environmental scientists at PNNL, used GitHub for an open experiment to store data, R code, a manuscript and various other aspects of analysis. The FiveThirtyEight group, led by Nate Silver, uses GitHub to share the data and code behind their stories and statistical exposés. We believe that sharing analysis in this way is critical for both helping your team work together productively and also for gaining public trust. At PNNL, we typically work in a team that includes both computational and non-computational scientists, so we wanted to create an environment where data exploration does not necessarily require computational expertise. To achieve this, we created a web-based visual analytic which exposes the data and capabilities within a project’s GitHub repository. This gives non-computational researchers a more accessible interface to the data, while allowing them access to the full range of computational methods contributed by their teammates. We first presented the Active Data Biology tool at Nature’s Publishing Better Science through Better Data conference. It was here that we met Open Knowledge International. Our shared passion for open and collaborative data through tools like Git led to a natural collaboration. We’re excited to be working with them on improving access to scientific data and results. logoOn the horizon, we are working together to integrate Frictionless Data and Good Tables into our tool to help validate and smooth our data access. One of the key aspects of data analysis is that it is fluid; over the course of investigation your methods and/or data will change. For that reason, it is important that the data integrity is always maintained. Good Tables is designed to enforce data quality; consistently verifying the accuracy of our data is essential in a project where many people can update the data.

One of the key aspects of data analysis is that it is fluid…For that reason, it is important that the data integrity is always maintained.

One of our real-world problems is that clinical data for biomedical projects is updated periodically as researchers re-examine patient records. Thus the meta-data describing a patient’s survival status or current treatments will change. A second challenge discovered through experience is that there are a fair number of entry mistakes, typos or incorrect data formatting. Working with the Open Knowledge International team, we hope to reduce these errors at their origin by enforcing data standards on entry, and continuously throughout the project. I look forward to data analysis having the same culture as software engineering, where openness and sharing has become the norm. To get there will take a bit of education as well as working out some standard structures/platforms to achieve our desired goal.

International Data Week: From Big Data to Open Data

- October 11, 2016 in Frictionless Data, Open Data, Open Knowledge, Open Research, Open Science, Open Standards, Small Data

Report from International Data Week: Research needs to be reproducible, data needs to be reusable and Data Packages are here to help. International Data Week has come and gone. The theme this year was ‘From Big Data to Open Data: Mobilising the Data Revolution’. Weeks later, I am still digesting all the conversations and presentations (not to mention, bagels) I consumed over its course. For a non-researcher like me, it proved to be one of the most enjoyable conferences I’ve attended with an exciting diversity of ideas on display. In this post, I will reflect on our motivations for attending, what we did, what we saw, and what we took back home. idw

Three conferences on research data

International Data Week (11-17 September) took place in Denver, Colorado and consisted of three co-located events: SciDataCon, International Data Forum, and the Research Data Alliance (RDA) 8th Plenary. Our main motivation for attending these events was to talk directly with researchers about Frictionless Data, our project oriented around tooling for working with Data “Packages”, an open specification for bundling related data together using a standardized JSON-based description format. The concepts behind Frictionless Data were developed through efforts at improving workflows for publishing open government data via CKAN. Thanks to a generous grant from the Sloan Foundation, we now have the ability to take what we’ve learned in civic tech and pilot this approach within various research communities. International Data Week provided one the best chances we’ve had so far to meet researchers attempting to answer today’s most significant challenges in managing research data. It was time well spent: over the week I absorbed interesting user stories, heard clearly defined needs, and made connections which will help drive the work we do in the months to come.

What are the barriers to sharing research data?

While our aim is to reshape how researchers share data through better tooling and specifications, we first needed to understand what non-technical factors might impede that sharing. On Monday, I had the honor to chair the second half of a session co-organized by Peter Fitch, Massimo Craglia, and Simon Cox entitled Getting the incentives right: removing social, institutional and economic barriers to data sharing. During this second part, Wouter Haak, Heidi Laine, Fiona Murphy, and Jens Klump brought their own experiences to bear on the subject of what gets in the way of data sharing in research. _MG_3504 Mr. Klump considered various models that could explain why and under what circumstances researchers might be keen to share their data—including research being a “gift culture” where materials like data are “precious gifts” to be paid back in kind—while Ms. Laine presented a case study directly addressing a key disincentive for sharing data: fears of being “scooped” by rival researchers. One common theme that emerged across talks was the idea that making it easier to credit researchers for their data via an enabling environment for data citation might a be a key factor in increasing data sharing. An emerging infrastructure for citing datasets via DOIs (Digital Object Identifiers) might be part of this. More on this later.

“…making it easier to credit researchers for their data via an enabling environment for data citation might a be a key factor in increasing data sharing”

What are the existing standards for research data?

For the rest of the week, I dove into the data details as I presented at sessions on topics like “semantic enrichment, metadata and data packaging”, “Data Type Registries”, and the “Research data needs of the Photon and Neutron Science community”. These sessions proved invaluable as they put me in direct contact with actual researchers where I learned about the existence (or in some cases, non-existence) of community standards for working with data as well as some of the persistent challenges. For example, the Photon and Neutron Science community has a well established standard in NeXus for storing data, however some researchers highlighted an unmet need for a lightweight solution for packaging CSVs in a standard way. Other researchers pointed out the frustrating inability of common statistical software packages like SPSS to export data into a high quality (e.g. with all relevant metadata) non-proprietary format as encouraged by most data management plans. And, of course, a common complaint throughout was the amount of valuable research data locked away in Excel spreadsheets with no easy way to package and publish them. These are key areas we are addressing now and in the coming months with Data Packages.

Themes and take-home messages

The motivating factor behind much of the infrastructure and standardization work presented was the growing awareness of the need to make scientific research more reproducible, with the implicit requirement that research data itself be more reusable. Fields as diverse as psychology and archaeology have been experiencing a so-called “crisis” of reproducibility. For a variety of reasons, researchers are failing to reproduce findings from their own or others’ experiments. In an effort to resolve this, concepts like persistent identifiers, controlled vocabularies, and automation played a large role in much of the current conversation I heard.

…the growing awareness of the need to make scientific research more reproducible, with the implicit requirement that research data itself be more reusable”

_MG_3511

Persistent Identifiers

Broadly speaking, persistent identifiers (PIDs) are an approach to creating a reference to a digital “object” that (a) stays valid over long periods of time and (b) is “actionable”, that is, machine-readable. DOIs, mentioned above and introduced in 2000, are a familiar approach to persistently identifying and citing research articles, but there is increasing interest in applying this approach at all levels of the research process from researchers themselves (through ORCID) to research artifacts and protocols, to (relevant to our interests) datasets. We are aware of the need to address this use case and, in coordination with our new Frictionless Data specs working group, we are working on an approach to identifiers on Data Packages.

Controlled Vocabularies

Throughout the conference, there was an emphasis on ensuring that records in published data incorporate some idea of semantic meaning, that is, making sure that two datasets that use the same term or measurement actually refer to the same thing by enforcing the use of a shared vocabulary. Medical Subject Headings (MeSH) from the United States National Library of Medicine is a good example of a standard vocabulary that many datasets use to consistently describe biomedical information. While Data Packages currently do not support specifying this type of semantic information in a dataset, the specification is not incompatible with this approach. As an intentionally lightweight publishing format, our aim is to keep the core of the specification as simple as possible while allowing for specialized profiles that could support semantics.

Automation

There was a lot of talk about increasing automation around data publishing workflows. For instance, there are efforts to create “actionable” Data Management Plans that help researchers walk through describing, publishing and archiving their data. A core aim of the Frictionless Data tooling is to automate as many elements of the data management process as possible. We are looking to develop simple tools and documentation for preparing datasets and defining schemas for different types of data so that the data can, for instance, be automatically validated according to defined schemas.

Making Connections

Of course, one of the major benefits of attending any conference was the chance to meet and interact with other research projects. For instance, we had really great conversations with Mackenzie DataStream project, a really amazing project for sharing and exploring water data in the Mackenzie River Basin in Canada. The technology behind this project already uses the Data Packages specifications, so look for a case study on the work done here on the Frictionless Data site soon. img_0350 There is never enough time in one conference to meet all the interesting people and explore all the potential opportunities for collaboration. If you are interested in learning more about our Frictionless Data project or would like to get involved, check out the links below. We’re always looking for new opportunities to pilot our approach. Together, hopefully, we can make reduce the friction in managing research data.

OpenSpending collaborates with Mexico’s Ministry of Finance to standardise and visualise government budget data

- September 9, 2016 in Featured, News, Open Spending, Open Standards

05   On September 8, 2016, Mexico became the first country to formally adopt the Open Fiscal Data Package, an international open data standard promoted by the Global Initiative for Fiscal Transparency (GIFT), in collaboration with Open Knowledge International and the World Bank, with the support of Omidyar Network. This collaboration is a pioneering step for publishing fiscal information in open formats. Mexico set an example to OpenSpending community who are intending to make use of the Open Fiscal Data Package and the new tools. The announcement was made during an event hosted by the Ministry of Finance of Mexico to present the Executive’s Budget Proposal for 2017. The Ministry also revealed that it published the 2008-2016 Federal Budget on its website. The data was prepared using the OpenSpending Viewer, a tool which allows the users to upload and analyze data, and create visualizations. One of Open Knowledge International’s core projects is OpenSpending, a free and open platform looking to track and analyse public fiscal information globally. The OpenSpending community is made up of citizens, organisations and government partners interested in using and sharing public fiscal data like government budget and spending information. The OpenSpending project is also involved in the creation of tools and standards to ensure this public information is more comparable and useful for a wide-range of users. For the past few months, OpenSpending, in collaboration with the Global Initiative for Fiscal Transparency and WB-BOOST initiative team, has been working with the Ministry of Finance of Mexico to pilot the OpenSpending tools and the Open Fiscal Data Package (OFDP). The OFDP powers the new version of the OpenSpending tools used to publish Mexico’s Federal Budget data. The OFDP helps make data releases more comparable and useful. The data package, embedded on Ministry of Finance’s web page, enables users to analyse the 2008-2016 budget, to create visualizations on all or selected spending sectors and share their personalized visualizations. All data is available for download in open format, while the API allows users to create their own apps based on this data. Screen Shot 2016-09-09 at 09.47.30 Explore the visualization here. In the next few months, the OpenSpending team will pilot the OFDP specification in a number of other countries. The specification and the OpenSpending tools are free and available to use to any interested stakeholder. To find out more, get in touch with us on the discussion forum. Upload financial data, browse datasets and learn more about public finances from around the world by visiting OpenSpending – let’s work together to build the world’s largest fiscal data repository.

Half Day Seminar on Open Standards: Document Freedom Day 2015

- March 14, 2015 in DFD15, DFDNepal, Events, FOSS Nepal, Open Knowledge Nepal, Open Standards

cover Basic Information: Date: 25th March 2015 Venue: CLASS Nepal, Maitighar, Kathmandu, Nepal. Time: 01:00 PM – 05:00 PM   On the occasion of Document Freedom Day 2015 Open Knowledge Nepal, Free and Open Source Software (FOSS) Nepal Community and Centre for Labour and Social Studies (CLASS) Nepal collaboratively are going to organize a “Half Day Seminar on Open Standards”. On the Seminar there will be Presentation, Training and Discussion Session on Open Standards, Open Data and Data Visualization from the member of Open Knowledge Nepal and FOSS Nepal community. The target audiences of the Events are Government Officers, Labourers Union representatives and Journalists.  

What is Document Freedom Day?

Document Freedom Day (DFD) is a day when people come together and inform themselves about the ever growing importance of Open Standards. With the rise of new technologies and hardware more and more communication is transmitted via electronic data. At the same time more and more information is provided in digital formats or even created in digital format and will never be transferred to any analogue media. Various stakeholders try to exploit these factors by offering communication or information services that use proprietary data formats to lock users into their software, hardware and services. But we do not have to go on like they want us to do. We can get rid of restrictions and vendor lock-ins if we keep on using Open Standards. These are data formats that can be freely implemented in any service, hardware or software.  

What is Open Standards?

Open Standards are essential for interoperability and freedom of choice based on the merits of different software applications. They provide freedom from data lock-in and the subsequent supplier lock-in. This makes Open Standards essential for governments, companies, organisations and individual users of information technology.  

Definition

An Open Standard refers to a format or protocol that is:   – Subject to full public assessment and use without constraints in a manner equally available to all parties; – Without any components or extensions that have dependencies on formats or protocols that do not meet the definition of an Open Standard themselves; – Free from legal or technical clauses that limit its utilisation by any party or in any business model; – Managed and further developed independently of any single supplier in a process open to the equal participation of competitors and third parties; – Available in multiple complete implementations by competing suppliers, or as a complete implementation equally available to all parties.  

What Open Standards mean to you

Open Standards ensure that you can:   – Collaborate and communicate with others, regardless of which software they are using – Upgrade or replace your apps and still be able to open and edit your old files – Choose which phone / tablet / computer you want to use without worrying about compatibility   Open Standards ensure that society has:   – More competitive software and tech products – More efficient governmental systems and services – More accessible high-end software for innovation and experimentation

Newsflash! OKFestival Programme Launches

- June 4, 2014 in Events, Featured, Free Culture, Join us, network, News, OKFest, OKFestival, Open Access, Open Data, Open Development, Open Economics, Open GLAM, Open Government Data, Open Humanities, Open Knowledge Foundation, Open Knowledge Foundation Local Groups, Open Research, Open Science, Open Spending, Open Standards, open-education, Panton Fellows, privacy, Public Domain, training, Transparency, Working Groups

At last, it’s here! Check out the details of the OKFestival 2014 programme – including session descriptions, times and facilitator bios here! Screen Shot 2014-06-04 at 4.11.42 PM

We’re using a tool called Sched to display the programme this year and it has several great features. Firstly, it gives individual session organisers the ability to update the details on the session they’re organising; this includes the option to add slides or other useful material. If you’re one of the facilitators we’ll be emailing you to give you access this week.

Sched also enables every user to create their own personalised programme to include the sessions they’re planning to attend. We’ve also colour-coded the programme to help you when choosing which conversations you want to follow: the Knowledge stream is blue, the Tools stream is red and the Society stream is green. You’ll also notice that there are a bunch of sessions in purple which correspond to the opening evening of the festival when we’re hosting an Open Knowledge Fair. We’ll be providing more details on what to expect from that shortly!

Another way to search the programme is by the subject of the session – find these listed on the right hand side of the main schedule – just click on any of them to see a list of sessions relevant to that subject.

As you check out the individual session pages, you’ll see that we’ve created etherpads for each session where notes can be taken and shared, so don’t forget to keep an eye on those too. And finally; to make the conversations even easier to follow from afar using social media, we’re encouraging session organisers to create individual hashtags for their sessions. You’ll find these listed on each session page.

We received over 300 session suggestions this year – the most yet for any event we’ve organised – and we’ve done our best to fit in as many as we can. There are 66 sessions packed into 2.5 days, plus 4 keynotes and 2 fireside chats. We’ve also made space for an unconference over the 2 core days of the festival, so if you missed out on submitting a proposal, there’s still a chance to present your ideas at the event: come ready to pitch! Finally, the Open Knowledge Fair has added a further 20 demos – and counting – to the lineup and is a great opportunity to hear about more projects. The Programme is full to bursting, and while some time slots may still change a little, we hope you’ll dive right in and start getting excited about July!

We think you’ll agree that Open Knowledge Festival 2014 is shaping up to be an action-packed few days – so if you’ve not bought your ticket yet, do so now! Come join us for what will be a memorable 2014 Festival!

See you in Berlin! Your OKFestival 2014 Team

Draft Open Data Policy for Qatar

- April 24, 2014 in Open Government Data, Open Knowledge Foundation Local Groups, Open MENA, Open Standards, Policy

The following post was originally published on the blog of our Open MENA community (Middle East and North Africa). The Qatari Ministry of Information and Communication Technologies (generally referred to as ictQATAR) had launched a public consultation on its draft Open Data Policy. I thus decided to briefly present a (long overdue) outline of Qatar’s Open Data status prior to providing a few insights of the current Policy document.

Public sector Open Data in Qatar: current status

Due to time constraints, I did not get the chance to properly assess public sector openness for the 2013 edition of the Open Data Index (I served as the MENA editor). My general remarks are as follows (valid both end of October 2013 and today):
  • Transport timetables exist online and in digital form but are solely available through non-governmental channels and are in no way available as Open Data. The data is thus neither machine-readable nor freely accessible — as per the Open Definition, — nor regularly updated.
  • Government budget, government spending and elections results are nowhere to be found online. Although there are no elections in the country (hence no election results to be found; Qatar lacks elected Parliament), government budget and spending theoretically exist.
  • Company register is curated by the Qatar Financial Centre Authority, is available online for anyone to read and seems to be up-to-date. Yet, the data is not available for download in anything other than PDF (not a machine-readable format) and is not openly licensed which severely restricts any use one could decide to make out of it.
  • National statistics seem to be partly available online through the Qatar Information Exchange office. The data does not, however, seem to be up-to-date, is mostly enclosed in PDFs and is not openly licensed.
  • Legislation content is provided online by Al-Meezan, the Qatari Legal Portal. Although data seems available in digital form, it does not seem to be up-to-date (no results for 2014 regardless of the query). The licensing of the website is not very clear as the mentions include both “copyright State of Qatar” and “CC-by 3.0 Unported”.
  • Postcodes/Zipcodes seem to be provided through the Qatar Postal Services yet the service does not seem to provide a list of all postcodes or a bulk download. The data, if we assume it’s available, is not openly licensed.
  • National map at a scale of 1:250,000 or better (1cm = 2.5km) is nowhere to be found online, at least I did not manage to (correct me if I am wrong).
  • Emissions of pollutants data is not available through the Ministry of Environment. (Such data is defined as “aggregate data about the emission of air pollutants, especially those potentially harmful to human health. “Aggregate” means national-level or more detailed, and on an annual basis or more often. Standard examples of relevant pollutants would be carbon monoxides, nitrogen oxides, or particulate matter.”)
This assessment would produce an overall score of 160 (as per the Open Data Index criteria) which would rank Qatar at the same place as Bahrain, that is much lower than other MENA states (e.g., Egypt and Tunisia). A national portal exists but it does not seem to comprehend what open format and licensing mean as data is solely provided as PDFs and Excel sheets, and is the property of the Government. (The portal basically redirects the user to the aforementioned country’s national statistics website.) Lastly, information requests can be made through the portal. The 2013 edition of the Open Data Barometer provides a complementary insight and addresses the crucial questions of readiness and outreach:
[There is] strong government technology capacity, but much more limited civil society and private sector readiness to secure benefits from open data. Without strong foundations of civil society freedoms, the Right to Information and Data Protection, it is likely to be far harder for transparency and accountability benefits of open data to be secured. The region has also seen very little support for innovation with open data, suggesting the economic potential of open data will also be hard to realise. This raises questions about the motivation and drivers for the launch of open data portals and platforms.
Screenshot from the Open Data Barometer 2013.

2014 Open Data Policy draft

Given the above assessment, I was pleasantly surprised to discover that a draft Open Data Policy is being composed by ictQATAR. The document sets the record straight from the beginning:
Information collected by or for the government is a national resource which should be managed for public purposes. Such information should be freely available for anyone to use unless there are compelling privacy, confidentiality or security considerations by the government. [...] Opening up government data and information is a key foundation to creating a knowledge based economy and society. Releasing up government-held datasets and providing raw data to their citizens, will allow them to transform data and information into tools and applications that help individuals and communities; and to promote partnerships with government to create innovative solutions.
The draft Policy paper then outlines that “all Government Agencies will put in place measures to release information and data”. The ictQATAR will be in charge of coordinating those efforts and each agency will need to nominate a senior manager internally to handle the implementation of the Open Data policy through the identification and release of datasets as well as the follow-up on requests to be addressed by citizens. The Policy emphasizes that “each agency will have to announce its “Terms of Use” for the public to re-use the data, requirement is at no fees”. The Policy paper also indicates how the national Open Data portal will operate. It will be “an index to serve as gateway to public for dataset discovery and search, and shall redirect to respective Government Agencies’ data source or webpage for download”. Which clearly indicates that each individual Agency will need to create own website where the data will be released and maintained. The proposed national Open Data portal is also suggested to operate as an aggregator of “all public feedback and requests, and the government agencies’ responses to the same”. Alongside, the portal will continue to allow the public to submit information requests (as per the freedom of information framework in the country). This is an interesting de facto implementation of the Freedom of Information Act Qatar still lacks. The draft Policy further states:
Where an Agency decides to make information available to the public on a routine basis, it should do so in a manner that makes the information available to a wide range of users with no requirement for registration, and in a non-proprietary, non-exclusive format.
This is an interesting remark and constitutes one of my main points of criticism to the proposed paper. The latter neither contains a mention about what the recommended formats should be nor about licensing. Thus, one is left wondering whether the Agencies should just continue to stick to Microsoft Excel and PDF formats. If these were adopted as the default formats, then the released data would not be truly open as none of these two formats is considered open and the files are not machine-readable (a pre-requisite for data to be defined as open). Indeed, instead of going for a lengthy description of various formats, it would have been much more useful to elaborate on preferred format, e.g. CSV. An additional concern is the lack of mention of a license. Even though the Policy paper does a great job emphasizing that the forthcoming data needs to be open for anyone to access, use, reuse and adapt, it makes no mention whatsoever about the envisioned licensing. Would the latter rely on existing Creative Commons licenses? Or would the ictQATAR craft its own license as have done other governments across the world? An additional reason for concern is the unclear status of payment to access data. Indeed, the Policy paper mentions at least three times (sections 4.2 (i); 4.4 (ii); Appendix 6, ‘Pricing Framework’ indicator) that the data has to be provided at no cost. Yet, the Consultation formulates the question:
Open Data should be provided free of charge where appropriate, to encourage its widespread use. However, where is it not possible, should such data be chargeable and if so, what are such datasets and how should they be charged to ensure they are reasonable?
This question indicates that financial participation from potential users is considered probable. If such a situation materialized, this would be damaging for the promising Open Data Policy as paying to access data is one of the greatest barriers to access to information (regardless of how low the fee might be). Thus, if the data is provided at a cost, it is not Open Data anymore as by definition, Open Data is data accessible at no cost for everyone. My personal impression is that the Policy draft is a step in the right direction. Yet the success of such a policy, if implemented, remains very much dependent on the willingness of the legislator to enable a shift towards increased transparency and accountability. My concerns stem from the fact that the national legislation has precedence over ictQATAR’s policy frameworks which may make it very difficult to achieve a satisfactory Open Data shift. The Policy draft states:
Agencies may also develop criteria at their discretion for prioritizing the opening of data assets, accounting for a range of factors, such as the volume and quality of datasets, user demand, internal management priorities, and Agency mission relevance, usefulness to the public, etc.
The possibility that an Agency might decide to not open up data because it would be deemed potentially harmful to the country’s image or suchlike is real. Given that no Freedom of Information Act exists, there is no possible appeal mechanism allowing to challenge a negative decision citing public interest as outweighing deemed security concerns. The real test for how committed to openness and transparency the government and its Agencies are will come at that time. The Appendix 6 is thus very imprecise regarding the legal and security constraints that might prevent opening up public sector data. Furthermore, the precedence of the national legislation should not be neglected: it for ex. prohibits any auditing or data release related to contracting and procurement; no tenders are published for public scrutiny. Although the country has recently established national general anti-corruption institutions, there is a lack of oversight of the Emir’s decisions. According to Transparency International Government Defence Anti-Corruption Index 2013, “the legislature is not informed of spending on secret items, nor does it view audit reports of defence spending and off-budget expenditure is difficult to measure”. – Note: I have responded to the consultation in my personal capacity (not as OpenMENA). Additional insights are to be read which I have chosen not to feature here.

The Open Definition in context: putting open into practice

- October 16, 2013 in Featured, linked-open-data, Open Data, Open Definition, Open Knowledge Definition, Open Standards

We’ve seen how the Open Definition can apply to data and content of many types published by many different kinds of organisation. Here we set out how the Definition relates to specific principles of openness, and to definitions and guidelines for different kinds of open data.

Why we need more than a Definition

The Open Definition does only one thing: as clearly and concisely as possible it defines the conditions for a piece of information to be considered ‘open’. The Definition is broad and universal: it is a key unifying concept which provides a common understanding across the diverse groups and projects in the open knowledge movement. At the same time, the Open Definition doesn’t provide in-depth guidance for those publishing information in specific areas, so detailed advice and principles for opening specific types of information – from government data, to scientific research, to the digital holdings of cultural heritage institutions – is needed alongside it. For example, the Open Definition doesn’t specify whether data should be timely; and yet this is a great idea for many data types. It doesn’t make sense to ask whether census data from a century ago is “timely” or not though! Guidelines for how to open up information in one domain can’t always be straightforwardly reapplied in another, so principles and guidelines for openness targeted at particular kinds of data, written specifically for the types of organisation that might be publishing them, are important. These sit alongside the Open Definition and help people in all kinds of data fields to appreciate and share open information, and we explain some examples here.

Principles for Open Government Data

In 2007 a group of open government advocates met to develop a set of principles for open government data, which became the “8 Principles of Open Government Data”. In 2010, the Sunlight Foundation revised and built upon this initial set with their Ten Principles for Opening up Government Information, which have set the standard for open government information around the world. These principles may apply to other kinds of data publisher too, but they are specifically designed for open government, and implementation guidance and support is focused on this domain. The principles share many of the key aspects of the Open Definition, but include additional requirements and guidance specific to government information and the ways it is published and used. The Sunlight principles cover the following areas: completeness, primacy, timeliness, ease of physical and electronic access, machine readability, non-discrimination, use of commonly owned standards, licensing, permanence, and usage costs.

Tim Berners-Lee’s 5 Stars for Linked Data

In 2010, Web Inventor Tim Berners-Lee created his 5 Stars for Linked Data, which aims to encourage more people to publish as Linked Data – that is using a particular set of technical standards and technologies for making information interoperable and interlinked. The first three stars (legal openness, machine readability, and non-proprietary format) are covered by the Open Definition, and the two additional stars add the Linked Data components (in the form of RDF, a technical specification). The 5 stars have been influential in various parts of the open data community, especially those interested in the semantic web and the vision of a web of data, although there are many other ways to connect data together.

Principles for specific kinds of information

At the Open Knowledge Foundation many of our Working Groups have been involved with others in creating principles for various types of open data and fields of work with an open element. Such principles frame the work of their communities, set out best practice as well as legal, regulatory and technical standards for openness and data, and have been endorsed by many leading people and organisations in each field. These include:

The Open Definition: the key principle powering the Global Open Knowledge Movement

All kinds of individuals and organisations can open up information: government, public sector bodies, researchers, corporations, universities, NGOs, startups, charities, community groups, individuals and more. That information can be in many formats – it may be spreadsheets, databases, images, texts, linked data, and more; and it can be information from any field imaginable – such as transport, science, products, education, sustainability, maps, legislation, libraries, economics, culture, development, business, design, finance and more. Each of these organisations, kinds of information, and the people who are involved in preparing and publishing the information, has its own unique requirements, challenges, and questions. Principles and guidelines (plus training materials, technical standards and so on!) to support open data activities in each area are essential, so those involved can understand and respond to the specific obstacles, challenges and opportunities for opening up information. Creating and maintaining these is a major activity for many of the Open Knowledge Foundation’s Working Groups as well as other groups and communities. At the same time, those working on openness in many different areas – whether open government, open access, open science, open design, or open culture – have shared interests and goals, and the principles and guidelines for some different data types can and do share many common elements, whilst being tailored to the specific requirements of their communities. The Open Definition provides the key principle which connects all these groups in the global open knowledge movement.

More about openness coming soon

Don’t miss our other posts about Defining Open Data, and exploring the Open Definition, why having a shared and agreed definition of open data is so important, and how one can go about “doing open data”.