You are browsing the archive for Open Knowledge.

Podcast: Pavel Richter on the value of open data

Open Knowledge International - August 25, 2017 in Interviews, Open Knowledge, podcasts

This month Pavel Richter, CEO of Open Knowledge International, was interviewed by Stephen Ladek of Aidpreneur for the 161st episode of his Terms of Reference podcast. Aidpreneur is an online community focused on social enterprise, humanitarian aid and international development that runs this podcast to cover important topics in the social impact sector. Under the title ‘Supporting The Open Data Movement’, Stephen Ladek and Pavel Richter discuss a range of topics surrounding open data, such as what open data means, how open data can improve people’s lives (including the role it can play in aid and development work) and the current state of openness in the world. As Pavel phrases it: “There are limitless ways where open data is part of your life already, or at least should be”. Pavel Richter joined Open Knowledge International as CEO in April 2015, following five years of experience as Executive Director of Wikimedia Deutschland. He explains how Open Knowledge International has set its’ focus on bridging the gap between the people who could make the best use of open data (civil society organisations and activists in areas such as human rights, health or the fight against corruption) and the people who have the technical knowledge on how to work with data. OKI can make an impact by bridging this gap, empowering these organisations to use open data to improve people’s lives. The podcast goes into several examples that demonstrate the value of open data in our everyday life, from how OpenStreetMap was used by volunteers following the Nepal earthquake to map where roads were destroyed or still accessible, to governments opening up financial data on tax returns or on how foreign aid money is spent, to projects such as OpenTrials opening up clinical trial data, so that people are able to get information on what kind of drugs are being tested for effectiveness against viruses such as Ebola or Zika. In addition, Stephen Ladek and Pavel Richter discuss questions surrounding potential misuse of open data, the role of the cultural context in open data, and the current state of open data around the world, as measured in recent initiatives such as the Open Data Barometer and the Global Open Data Index. Listen to the full podcast below, or visit the Aidpreneur website for more information:  

Fostering open, inclusive, and respectful participation

Sander van der Waal - August 21, 2017 in community, network, Open Knowledge, Open Knowledge international Local Groups

At Open Knowledge International we have been involved with various projects with other civil society organisations aiming for the release of public interest data, so that anyone can use it for any purpose. More importantly, we focus on putting this data to use, to help it fulfil its potential of working towards fairer and more just societies. Over the last year, we started the first phase of the project Open Data for Tax Justice, because we and our partners believe the time is right to demand for more data to be made openly available to scrutinise the activities of businesses. In an increasingly globalised world, multinational corporations have tools and techniques to their disposal to minimise their overall tax bill, and many believe that this gives them an unfair advantage over ordinary citizens. Furthermore, the extent to which these practices take place is unknown, because taxes that multinational corporations pay in all jurisdictions in which they operate are not reported publicly. By changing that we can have a proper debate about whether the rules are fair, or whether changes will need to be made to share the tax bill in a different way. For us at Open Knowledge International, this is an entry into a new domain. We are not tax experts, but instead we rely on the expertise of our partners. We are open to engaging all experts to help shape and define together how data should be made available, and how it can be put to use to work towards tax systems that can rely on more trust from their citizens. Unsurprisingly, in such a complex and continuously developing field, debates can get very heated. People are obviously very passionate about this, and being passionate open data advocates ourselves, we sympathise. However, we think it is crucial that the passion to strive for a better world should never escalate to personal insults, ad-hominem attacks, or violate basic norms in any other way. Unfortunately, this happened recently with a collaborator on a project. While they made clear they were not affiliated with Open Knowledge International, nevertheless their actions reflected very badly on the overall project and we deeply condemn their actions. Moving forward, we want to make more explicitly clear what behaviour is and is not acceptable within the context of the projects we are part of. To that end, we are publishing project participation guidelines that make clear how we define acceptable and unacceptable behaviour, and what you can do if you feel any of these guidelines are being violated. We invite your feedback on these guidelines, as it is important that these norms are shared among our community. So please let us know on our Open Knowledge forum what you think and where you think these guidelines could be improved. Furthermore, we would like to make clear what the communities we are part of, like the one around tax justice, can expect from Open Knowledge International beyond enforcing the basic behavioural norms that we set out in the guidelines linked above. Being in the business of open data, we love facts and aim to record many facts in the databases we build. However, facts can be used to reach different and sometimes even conflicting conclusions. Some partners engage heavily on social media channels like Twitter to debate conflicting interpretations, and other partners choose different channels for their work. Open Knowledge International is not, and will never be, in a position to be the arbiter on all interpretations that partners make about the data that we publish. Our expertise is in building open databases, helping put the data to use, and convening communities around the work that we do. On the subject matter of, for example, tax justice, we are more similar to those of us who are interested and care about the topic, but would rely on the debate being led by experts in the field. Where we spot abuse of the data published in databases we run, or obvious misrepresentation of the data, we will speak out. But we will not monitor or take a stance on all issues that are being debated by our partners and the wider communities around our projects. Finally, we strongly believe that the open knowledge movement is best served by open and diverse participation. We aim for the project participation guidelines to spell out our expectations and hope these will help us move towards developing more inclusive and diverse communities, where everyone who wants to participate respectfully feels welcomed to do so. Do you think these guidelines are a step in the right direction? What else do you feel we should be doing at Open Knowledge International? We look forward to hearing from you in our forum.

OpenSpending platform update

Paul Walsh - August 16, 2017 in Open Knowledge, Open Spending

Introduction

OpenSpending is a free, open and global platform to search, visualise, and analyse fiscal data in the public sphere. This week, we soft launched an updated technical platform, with a newly designed landing page. Until now dubbed “OpenSpending Next”, this is a completely new iteration on the previous version of OpenSpending, which has been in use since 2011. At the core of the updated platform is Fiscal Data Package. This is an open specification for describing and modelling fiscal data, and has been developed in collaboration with GIFT. Fiscal Data Package affords a flexible approach to standardising fiscal data, minimising constraints on publishers and source data via a modelling concept, and enabling progressive enhancement of data description over time. We’ll discuss in more detail below. From today:
  • Publishers can get started publishing fiscal data with the interactive Packager, and explore the possibilities of the platform’s rich API, advanced visualisations, and options for integration.
  • Hackers can work on a modern stack designed to liberate fiscal data for good! Start with the docs, chat with us, or just start hacking.
  • Civil society can access a powerful suite of visualisation and analysis tools, running on top of a huge database of open fiscal data. Discover facts, generate insights, and develop stories. Talk with us to get started.
All the work that went into this new version of OpenSpending was only made possible by our funders along the way. We want to thank Hewlett, Adessium, GIFT, and the OpenBudgets.eu consortium for helping fund this work. As this is now completely public, replacing the old OpenSpending platform, we do expect some bugs and issues. If you see anything, please help us by opening a ticket on our issue tracker.

Features

The updated platform has been designed primarily around the concept of centralised data, decentralised views: we aim to create a large, and comprehensive, database of fiscal data, and provide various ways to access that data for others to build localised, context-specific applications on top. The major features of relevance to this approach are described below.

Fiscal Data Package

As mentioned above, Fiscal Data Package affords a flexible approach to standardising fiscal data. Fiscal Data Package is not a prescriptive standard, and imposes no strict requirements on source data files. Instead, users “map” source data columns to “fiscal concepts”, such as amount, date, functional classification, and so on, so that systems that implement Fiscal Data Package can process a wide variety of sources without requiring change to the source data formats directly. A minimal Fiscal Data Package only requires mapping an amount and a date concept. There are a range of additional concepts that make fiscal data usable and useful, and we encourage the mapping of these, but do not require them for a valid package. Based on this general approach to specifying fiscal data with Fiscal Data Package, the updated OpenSpending likewise imposes no strict requirements on naming of columns, or the presence of columns, in the source data. Instead, users (of the graphical user interface, and also of the application programming interfaces) can provide any source data, and iteratively create a model on top of that data that declares the fiscal measures and dimensions.

GUIs

Packager

The Packager is the user-facing app that is used to model source data into Fiscal Data Packages. Using the Packager, users first get structural and schematic validation of the source files, ensuring that data to enter the platform is validly formed, and then they can model the fiscal concepts in the file, in order to publish the data. After initial modelling of data, users can also remodel their data sources for a progressive enhancement approach to improving data added to the platform.

Explorer

The Explorer is the user-facing app for exploration and discovery of data available on the platform.

Viewer

The Viewer is the user-facing app for building visualisations around a dataset, with a range of options, for presentation, and embedding views into 3rd party websites.

DataMine

The DataMine is a custom query interface powered by Re:dash for deep investigative work over the database. We’ve included the DataMine as part of the suite of applications as it has proved incredibly useful when working in conjunction with data journalists and domain experts, and also for doing quick prototype views on the data, without the limits of API access, as one can use SQL directly.

APIs

Datastore

The Datastore is a flat file datastore with source data stored in Fiscal Data Packages, providing direct access to the raw data. All other databases are built from this raw data storage, providing us with a clear mechanism for progressively enhancing the database as a whole, as well as building on this to provide such features directly to users.

Analytics and Search

The Analytics API provides a rich query interface for datasets, and the search API provides exploration and discovery capabilities across the entire database. At present, search only goes over metadata, but we have plans to iterate towards full search over all fiscal data lines.

Data Importers

Data Importers are based on a generic data pipelining framework developed at Open Knowledge International called Data Package Pipelines. Data Importers enable us to do automated ETL to get new data into OpenSpending, including the ability to update data from the source at specified intervals. We see Data Importers as key functionality of the updated platform, allowing OpenSpending to grow well beyond the one thousand plus datasets that have been uploaded manually over the last five or so years, towards tens of thousands of datasets. A great example of how we’ve put Data Importers to use is in the EU Structural Funds data that is part of the Subsidy Stories project.

Iterations

It is slightly misleading to announce the launch today, when we’ve in fact been using and iterating on OpenSpending Next for almost 2 years. Some highlights from that process that have led to the platform we have today are as follows.

SubsidyStories.eu with Adessium

Adessium provided Open Knowledge International with funding towards fiscal transparency in Europe, which enabled us to build out significant parts of the technical platform, commision work with J++ on Agricultural Subsidies , and, engage in a productive collaboration with Open Knowledge Germany on what became SubsidyStories.eu, which even led to another initiative from Open Knowledge Germany called The Story Hunt. This work directly contributed to the technical platform by providing an excellent use case for the processing of a large, messy amount of source data into a normalised database for analysis, and doing so while maintaining data provenance and the reproducibility of the process. There is much to do in streamlining this workflow, but the benefits, in terms of new use cases for the data, are extensive. We are particularly excited by this work, and the potential to continue in this direction, by building out a deep, open database as a potential tool for investigation and telling stories with data.

OpenBudgets.eu via Horizon 2020

As part of the OpenBudgets.eu consortium, we were able to both build out parts of the technical platform, and have a live use case for the modularity of the general architecture we followed. A number of components from the core OpenSpending platform have been deployed into the OpenBudgets.eu platform with little to no modification, and the analytical API from OpenSpending was directly ported to run on top of a triple store implementation of the OpenBudgets.eu data model. An excellent outcome of this project has been the close and fruitful work with both Open Knowledge Germany and Open Knowledge Greece on technical, community, and journalistic opportunities around OpenSpending, and we plan for continuing such collaborations in the future.

Work on Fiscal Data Package with GIFT

Over three phases of work since 2015 (the third phase is currently running), we’ve been developing Fiscal Data Package as a specification to publish fiscal data against. Over this time, we’ve done extensive testing of the specification against a wide variety of data in the wild, and we are iterating towards a v1 release of the specification later this year. We’ve also been piloting the specification, and OpenSpending, with national governments. This has enabled extensive testing of both the manual modeling of data to the specification using the OpenSpending Packager, and automated ETL of data into the platform using the Data Package Pipelines framework. This work has provided the opportunity for direct use by governments of a platform we initially designed with civil society and civic tech actors in mind. We’ve identified difficulties and opportunities in this arena at both the implementation and the specification level, and we look forward to continuing this work and solving use cases for users inside government.

Credits

Many people have been involved in building the updated technical platform. Work started back in 2014 with an initial architectural vision articulated by our peers Tryggvi Björgvinsson and Rufus Pollock. The initial vision was adapted and iterated on by Adam Kariv (Technical Lead) and Sam Smith (UI/X), with Levko Kravets, Vitor Baptista, and Paul Walsh. We reused and enhanced code from Friedrich Lindenberg. Lazaros Ioannidis and Steve Bennett made important contributions to the code and the specification respectively. Diana Krebs, Cecile Le Guen, Vitoria Vlad and Anna Alberts have all contributed with project management, and feature and design input.

What’s next?

There is always more work to do. In terms of technical work, we have a long list of enhancements.
However, while the work we’ve done in the last years has been very collaborative with our specific partners, and always towards identified use cases and user stories in the partnerships we’ve been engaged in, it has not, in general, been community facing. In fact, a noted lack of community engagement goes back to before we started on the new platform we are launching today. This has to change, and it will be an important focus moving forward. Please drop by at our forum for any feedback, questions, and comments.

Using the Global Open Data Index to strengthen open data policies: Best practices from Mexico

Oscar Montiel - August 16, 2017 in Global Open Data Index, Open Data Index, Open Government Data, Open Knowledge

This is a blog post coauthored with Enrique Zapata, of the Mexican National Digital Strategy. As part of the last Global Open Data Index (GODI), Open Knowledge International (OKI) decided to have a dialogue phase, where we invited individuals, CSOs, and national governments to exchange different points of view, knowledge about the data and understand data publication in a more useful way. In this process, we had a number of valuable exchanges that we tried to capture in our report about the state of open government data in 2017, as well as the records in the forum. Additionally, we decided to highlight the dialogue process between the government and civil society in Mexico and their results towards improving data publication in the executive authority, as well as funding to expand this work to other authorities and improve the GODI process. Here is what we learned from the Mexican dialogue:

The submission process

During this stage, GODI tries to directly evaluate how easy it is to find and their data quality in general. To achieve this, civil society and government actors discussed how to best submit and agreed to submit together, based on the actual data availability.   Besides creating an open space to discuss open data in Mexico and agreeing on a joint submission process, this exercise showed some room for improvement in the characteristics that GODI measured in 2016:
  • Open licenses: In Mexico and many other countries, the licenses are linked to datasets through open data platforms. This showed some discrepancies with the sources referenced by the reviewers since the data could be found in different sites where the license application was not clear.
  • Data findability: Most of the requested datasets assess in GODI are the responsibility of the federal government and are available in datos.gob.mx. Nevertheless, the titles to identify the datasets are based on technical regulation needs, which makes it difficult for data users to easily reach the data.
  • Differences of government levels and authorities: GODI assesses national governments but some of these datasets – such as land rights or national laws – are in the hands of other authorities or local governments. This meant that some datasets can’t be published by the federal government since it’s not in their jurisdiction and they can’t make publication of these data mandatory.
 

Open dialogue and the review process

  During the review stage, taking the feedback into account, the Open Data Office of the National Digital Strategy worked on some of them. They summoned a new session with civil society, including representatives from the Open Data Charter and OKI in order to:
  • Agree on the state of the data in Mexico according to GODI characteristics;
  • Show the updates and publication of data requested by GODI;
  • Discuss paths to publish data that is not responsibility of the federal government;
  • Converse about how they could continue to strengthen the Mexican Open Data Policy.
  The results   As a result of this dialogue, we agreed six actions that could be implemented internationally beyond just the Mexican context both by governments with centralised open data repositories and those which don’t centralise their data, as well as a way to improve the GODI methodology:  
  1. Open dialogue during the GODI process: Mexico was the first country to develop a structured dialogue to agree with open data experts from civil society about submissions to GODI. The Mexican government will seek to replicate this process in future evaluations and include new groups to promote open data use in the country. OKI will take this experience into account to improve the GODI processes in the future.
  2. Open licenses by default: The Mexican government is reviewing and modifying their regulations to implement the terms of Libre Uso MX for every website, platform and online tool of the national government. This is an example of good practice which OKI have highlighted in our ongoing Open Licensing research.
  3. “GODI” data group in CKAN: Most data repositories allow users to create thematic groups. In the case of GODI, the Mexican government created the “Global Open Data Index” group in datos.gob.mx. This will allow users to access these datasets based on their specific needs.
  4. Create a link between government built visualization tools and datos.gob.mx: The visualisations and reference tools tend to be the first point of contact for citizens. For this reason, the Mexican government will have new regulations in their upcoming Open Data Policy so that any new development includes visible links to the open data they use.
  5. Multiple access points for data: In August 2018, the Mexican government will launch a new section on datos.gob.mx to provide non-technical users easy access to valuable data. These data called “‘Infraestructura de Datos Abiertos MX’ will be divided into five easy-to-explore and understand categories.
  6. Common language for data sets: Government naming conventions aren’t the easiest to understand and can make it difficult to access data. The Mexican government has agreed to change the names to use more colloquial language can help on data findability and promote their use. In case this is not possible with some datasets, the government will go for an option similar to the one established in point 5.
We hope these changes will be useful for data users as well as other governments who are looking to improve their publication policies. Got any other ideas? Share them with us on Twitter by messaging @OKFN or send us an email to index@okfn.org  

Data-cards – a design pattern

Sam Smith - August 15, 2017 in Frictionless Data, Open Knowledge

Cross-posted on smth.uk
It can be useful to recognise patterns in the challenges we face, and in our responses to those challenges. In doing this, we can build a library of solutions, a useful resource when similar challenges arise in the future. When working on innovative projects, as is often the case at Open Knowledge International, creating brand new challenges is inevitable. With little or no historical reference material on how best to tackle these challenges, paying attention to your own repeatable solutions becomes even more valuable. From a user interface design point of view, these solutions come in the form of design patterns – reusable solutions to commonly occurring problems. Identifying, and using design patterns can help create familiar processes for users; and by not reinventing the wheel, you can save time in production too. In our work on Data Packages, we are introducing a new task into the world – creating those data packages. This task can be quite simple, and it will ultimately be time saving for people working with data. That said, there is no escaping the fact that this is a task that has never before been asked of people, one that will need to be done repeatedly, and potentially, from within any number of interfaces. It has been my task of late to design some of these interfaces; I’d like to highlight one pattern that is starting to emerge – the process of describing, or adding metadata to, the columns of a data table. I was first faced with this challenge when working on OS Packager. The objective was to present a recognisable representation of the columns, and facilitate the addition of metadata for each of those columns. The adding of data would be relatively straight forward, a few form fields. The challenge lay in helping the user to recognise those columns from the tables they originated. As anyone who works with spreadsheets on a regular basis will know, they aren’t often predictably or uniformly structured, meaning it is not always obvious what you’re looking at. Take them out of the familiar context of the application they were created in, and this problem could get worse. For this reason, just pulling a table header is probably not sufficient to identify a column. We wanted to provide a preview of the data, to give the best chance of it being recognisable. In addition to this, I felt it important to keep the layout as close as possible to that of say Excel. The simplest solution would be to take the first few rows of the table, and put a form under each column, for the user to add their metadata.     This is a good start, about as recognisable and familiar as you’re going to get. There is one obvious problem though, this could extend well beyond the edge of the users screen, leading to an awkward navigating experience. For an app aimed at desktop users, horizontal scrolling, in any of its forms, would be problematic. So, in the spirit of the good ol’ webpage, let’s make this thing wrap. That is to say that when an element can not fit on the screen, it moves to a new “line”. When doing this we’ll need some vertical spacing where this new line occurs, to make it clear that one column is separate from the one above it. We then need horizontal spacing to prevent the false impression of grouping created by the rows.     The data-card was born. At the time of writing it is utilised in OS Packager, pretty closely resembling the above sketch.     Data Packagist is another application that creates data packages, and it faces the same challenges as described above. When I got involved in this project there was already a working prototype, I saw in this prototype data cards beginning to emerge. It struck me that if these elements followed the same data card pattern created for OS Packager, they could benefit in two significant ways. The layout and data preview would again allow the user to more easily recognise the columns from their spreadsheet; plus the grid layout would lend itself well to drag and drop, which would mean avoiding multiple clicks (of the arrows in the screenshot above) when reordering. I incorporated this pattern into the design.     Before building this new front-end, I extracted what I believe to be the essence of the data-card from the OS Packager code, to reuse in Data Packagist, and potentially future projects. While doing so I thought about the current and potential future uses, and the other functions useful to perform at the same time as adding metadata. Many of these will be unique to each app, but there are a couple that I believe likely to be recurring:
  • Reorder the columns
  • Remove / ignore a column
These features combine with those of the previous iteration to create this stand-alone data-card project: Time will tell how useful this code will be for future work, but as I was able to use it wholesale (changing little more than a colour variable) in the implementation of the Data Packagist front-end, it came at virtually no additional cost. More important than the code however, is having this design pattern as a template, to solve this problem when it arises again in the future.

Data-cards – a design pattern

Sam Smith - August 15, 2017 in Frictionless Data, Open Knowledge

Cross-posted on smth.uk
It can be useful to recognise patterns in the challenges we face, and in our responses to those challenges. In doing this, we can build a library of solutions, a useful resource when similar challenges arise in the future. When working on innovative projects, as is often the case at Open Knowledge International, creating brand new challenges is inevitable. With little or no historical reference material on how best to tackle these challenges, paying attention to your own repeatable solutions becomes even more valuable. From a user interface design point of view, these solutions come in the form of design patterns – reusable solutions to commonly occurring problems. Identifying, and using design patterns can help create familiar processes for users; and by not reinventing the wheel, you can save time in production too. In our work on Data Packages, we are introducing a new task into the world – creating those data packages. This task can be quite simple, and it will ultimately be time saving for people working with data. That said, there is no escaping the fact that this is a task that has never before been asked of people, one that will need to be done repeatedly, and potentially, from within any number of interfaces. It has been my task of late to design some of these interfaces; I’d like to highlight one pattern that is starting to emerge – the process of describing, or adding metadata to, the columns of a data table. I was first faced with this challenge when working on OS Packager. The objective was to present a recognisable representation of the columns, and facilitate the addition of metadata for each of those columns. The adding of data would be relatively straight forward, a few form fields. The challenge lay in helping the user to recognise those columns from the tables they originated. As anyone who works with spreadsheets on a regular basis will know, they aren’t often predictably or uniformly structured, meaning it is not always obvious what you’re looking at. Take them out of the familiar context of the application they were created in, and this problem could get worse. For this reason, just pulling a table header is probably not sufficient to identify a column. We wanted to provide a preview of the data, to give the best chance of it being recognisable. In addition to this, I felt it important to keep the layout as close as possible to that of say Excel. The simplest solution would be to take the first few rows of the table, and put a form under each column, for the user to add their metadata.     This is a good start, about as recognisable and familiar as you’re going to get. There is one obvious problem though, this could extend well beyond the edge of the users screen, leading to an awkward navigating experience. For an app aimed at desktop users, horizontal scrolling, in any of its forms, would be problematic. So, in the spirit of the good ol’ webpage, let’s make this thing wrap. That is to say that when an element can not fit on the screen, it moves to a new “line”. When doing this we’ll need some vertical spacing where this new line occurs, to make it clear that one column is separate from the one above it. We then need horizontal spacing to prevent the false impression of grouping created by the rows.     The data-card was born. At the time of writing it is utilised in OS Packager, pretty closely resembling the above sketch.     Data Packagist is another application that creates data packages, and it faces the same challenges as described above. When I got involved in this project there was already a working prototype, I saw in this prototype data cards beginning to emerge. It struck me that if these elements followed the same data card pattern created for OS Packager, they could benefit in two significant ways. The layout and data preview would again allow the user to more easily recognise the columns from their spreadsheet; plus the grid layout would lend itself well to drag and drop, which would mean avoiding multiple clicks (of the arrows in the screenshot above) when reordering. I incorporated this pattern into the design.     Before building this new front-end, I extracted what I believe to be the essence of the data-card from the OS Packager code, to reuse in Data Packagist, and potentially future projects. While doing so I thought about the current and potential future uses, and the other functions useful to perform at the same time as adding metadata. Many of these will be unique to each app, but there are a couple that I believe likely to be recurring:
  • Reorder the columns
  • Remove / ignore a column
These features combine with those of the previous iteration to create this stand-alone data-card project: Time will tell how useful this code will be for future work, but as I was able to use it wholesale (changing little more than a colour variable) in the implementation of the Data Packagist front-end, it came at virtually no additional cost. More important than the code however, is having this design pattern as a template, to solve this problem when it arises again in the future.

New research: Understanding the drivers of license proliferation

Danny Lämmerhirt - August 8, 2017 in Open Knowledge

Open licensing is still a major challenge for open data publication. In a recent blog post on the state of open licensing in 2017 Open Knowledge International identified that governments often decide to create custom licenses instead of using standard open licenses such as Creative Commons Attribution 4.0. This so-called license proliferation is problematic for a variety of reasons. Custom licenses necessitate that data users know all legal arrangements of these licenses – a problem that standard licenses are intended to avoid by clearly and easily stating use rights. Custom licenses can also exacerbate legal compatibility issues across licenses, which makes it hard (or impossible) to combine and distribute data coming from different sources. Because of legal uncertainties and compatibility issues, license proliferation can have chilling effects on the reuse of data and in the worst case prevent data reuse entirely. When investigating this topic further we noticed a dearth of knowledge about the drivers of license proliferation: neither academia nor grey literature seem to give systematic answers, but there are some great first analyses, as well as explanations why license proliferation is bad. Why do governments create custom licenses? Who within government decides that standard licenses are not the best solution to make data and content legally open? How do governments organise the licensing process and how can license recommendations applied across government agencies?   Exploring the drivers of license proliferation In order to address these questions Open Knowledge International started a research project into license proliferation. Using the findings of the Global Open Data Index (GODI) 2016/17 as a starting point, we first mapped out how many different licenses are used in a selection of 20 countries. This includes following countries which either rank high in GODI, or where the Open Knowledge community is present: Taiwan, Australia, Great Britain, France, Finland, Canada, Norway, New Zealand, Brazil, Denmark, Colombia, Mexico, Japan, Argentina, Belgium, Germany, Netherlands, Greece, Nepal, Singapore. Now we want to explore how governments decide what kind of license to use for data publication. We intend to publish the results in a narrative report to inform the open data community to understand the licensing process better, to inform license stewardship, and to advocate for the use of standard licenses.   Get in touch! We are planning to run interviews with government officials who are involved in licensing.  Please don’t hesitate and get in touch with us by sending an email to research@okfn.org. Feedback from government officials working on licensing is much appreciated. Also do reach out if you have background knowledge about the licensing situation in above listed countries, or if you have contacts to government. We hope to hear from you soon!  

Why MyData 2017?

Open Knowledge Finland - August 2, 2017 in community, network, OK Finland, Open Knowledge

This is a guest post explaining the focus of the MyData conference in Tallinn and Helsinki later this month. By a famous writing tip, you should always start texts with ‘why?’. Here we are taking that tip, and we actually find many ways to answer the big Why. So, Why MyData 2017? Did you get your data after MyData 2016 conference? No, you did not. There is lots of work to be done, and we need all the companies, governments, individuals and NGO’s on board on Aug 31-Sep 1 in Tallinn and Helsinki. When else would you meet the other over 800 friends at once? Because no. 1: The work did not stop after MyData 2016 The organizers Fing, Aalto University, Open Knowledge Finland, and Tallinn University have been working on the topic also after the conference. Fing continues their MesInfos project, started in 2012, which goes to its second phase in 2017: implementing the MyData approach in France with a long-term pilot involving big corporations, public actors, testers and a platform. Aalto University is the home base of human-centric personal data research in Finland. Many Helsinki-based pieces of research contribute their academic skills to the conference’s Academic workshops. Open Knowledge Finland, apart from giving the conference an organizational kick also fosters a project researching MyData implementation in Finnish public sector, of which we will hear in the conference too. Tallinn University, as the newest addition to the group of organizers, will host the conference day in Tallinn to set the base for and inspire MyData initiatives in Estonian companies, public sector, and academic domain. In addition to the obvious ones, multiple MyData inspired companies to continue on their own. Work continues for example in Alliance meetings, and in some cases, there are people working from the bottom up and acting as change makers in their organization. MyData 2016 went extremely well, 95 % of the feedback was positive, and the complaints were related to organizational issues like the positioning of the knives during lunch time. Total individual visitor count was 670 from 24 countries. All this was for (at the time) niche conference, organized for the first time by a team mainly of part time workers. The key to success was the people who came in offering their insights as presenters or their talents in customer care as volunteers. MyData 2017 is, even more, community driven than the year before – again a big bunch of devoted presenters, and the volunteers have been working already since March in weekly meetings, talkoot. Because no. 2: The Community did not stop existing – it started to grow MyData gained momentum in 2016 – the MyData White paper is mentioned in a ‘Staff Working Document on the free flow of data and emerging issues of the European data economy’, on pages 24-25. The white paper is also now translated from Finnish to English and Portuguese. Internationally, multiple Local Hubs have been founded this year – of which you hear more about in the Global track of the conference – and a MyData Symposium was held in Japan earlier this year. The PIMS (Personal Information Management Systems) community, who met for the fourth time during the 2016 conference, has been requesting more established community around the topic. “Building a global community and sharing ideas” is one goal of MyData 2017, and as a very concrete action, the conference organizing team and PIMS community have agreed to merge their efforts under the umbrella name of MyData. The MyData Global Network Founding Members are reviewing the Declaration of MyData Principles to be presented during MyData 2017. Next round table meeting for the MyData Global Network will be held in Aarhus in November 23.–24. 2017.   Open Knowledge Estonia was founded after last year’s conference. Since MyData was nurtured into its current form inside the Open Knowledge movement, where Open Knowledge Finland still plays the biggest role, MyData people feel very close to other Open Knowledge chapters. See for yourself, how nicely Rufus Pollock explains in this video from MyData 2016 how Open Data and MyData are related. Because no. 3: Estonians are estonishing “Why Tallinn then?” is a question we hear a lot. The closeness of the two cities, also sometimes jointly called Talsinki, makes the choice very natural to the Finns and Estonians, but might seem weird looking from outside. Estonia holds the Presidency of the Council of the EU in the second part of 2017. In an e-Estonia, home of the infamous e-residency, MyData fits naturally in the pool of ideas to be tossed around during that period. Now, having the ‘Free movement of data’ as the fifth freedom within the European Union, in addition to goods, capital, service, and people, has been suggested by Estonians, and MyData way of thinking is a crucial part to advance that. Estonia and Finland co-operate in developing X-road, a data exchange layer for national information systems, between the two countries. In 2017, the Nordic Institute for Interoperability (NIIS) was founded to advance the X-road in other countries as well. Finnish population registry center and their digitalized services esuomi.fi is the main partner of the conference in 2017 Estonia and Finland both as small countries are very good places to test new ideas. Both in Helsinki and Tallinn, we now have ongoing ‘MyData Alliance’ meetups for companies and public organizations who want to advance MyData in their organizations. A goal of MyData in general, “we want to make Finland the Moomin Valley of personal data” will be expanded to “we want to make Finland and Estonia the Moomin Valley of personal data”.  

Open Data for Tax Justice design sprint: building a pilot database of public country-by-country reporting

Stephen Abbott Pugh - July 27, 2017 in Open Knowledge

Tax justice advocates, global campaigners and open data specialists came together this week from across the world to work with Open Knowledge International on the first stages of creating a pilot country-by-country reporting database. Such a database may enable to understand the activities of multinational corporations and uncover potential tax avoidance schemes.  This design sprint event was part of our Open Data for Tax Justice project to create a global network of people and organisations using open data to improve advocacy, journalism and public policy around tax justice in line with our mission to empower civil society organisations to use open data to improve people’s lives. In this post my colleague Serah Rono and I share our experiences and learnings from the sprint.    What is country-by-country reporting?

Image: Financial Transparency Coalition

Country-by-country reporting (CBCR) is a transparency mechanism which requires multinational corporations to publish information about their economic activities in all of the countries where they operate. This includes information on the taxes they pay, the number of people they employ and the profits they report. in order  Publishing this information canto bring to light structures or techniques multinational corporationsthey might be using to avoid paying tax in certain jurisdictions by shifting their profits or activities elsewhere.

In February 2017, Open Knowledge International published a white paper co-authored by Alex Cobham, Jonathan Gray and Richard Murphy which examined the prospects for creating a global public database on the tax contributions and economic activities of multinational companies as measured by CBCR. The authors found that such a public database was possible and concluded that a pilot database could be created by bringing together the best existing source of public CBCR information – disclosures made by European Union banking institutions in line with the Capital Requirements Directive IV (CRD IV) passed in 2013.  The aim of our design sprint was to take the first steps towards the creation of this pilot database.   What did we achieve?

From left to right: Tim Davies (Open Data Services), Jonathan Gray (University of Bath/Public Data Lab), Tommaso Faccio (University of Nottingham/BEPS Monitoring Group), Oliver Pearce (Oxfam GB), Elena Gaita (Transparency International EU), Dorcas Mensah (University of Edinburgh/Tax Justice Network – Africa) and Serah Rono (Open Knowledge International). Photo: Stephen Abbott Pugh

A design sprint is intended to be a short and sharp process bringing together a multidisciplinary team in order to quickly prototype and iterate on a technical product.

On Monday 24th and Tuesday 25th July 2017, Open Knowledge International convened a team of tax justice, advocacy, research and open data experts at Friends House in London to work alongside developers and a developer advocate from our product team. This followed three days of pre-sprint planning and work on the part of our developers. All the outputs of this event are public on Google Drive, Github and hackmd.io. To understand more from those who had knowledge of trying to find and understand CRD IV data, we heard expert presentations from George Turner of Tax Justice Network on the scale of international tax avoidance, Jason Braganza of Tax Justice Network – Africa and Financial Transparency Coalition on why developing countries need public CBCR (see report for more details) and Oliver Pearce of Oxfam Great Britain on the lessons learned from using CRD IV data for the Opening the vaults and Following the money reports. These were followed by a presentation from Adam Kariv and Vitor Baptista of Open Knowledge International on how they would be reusing open-source tech products developed for our Open Spending and OpenTrials projects to help with Open Data for Tax Justice. Next we discussed the problems and challenges the attendees had experienced when trying to access or use public CBCR information before proposing solutions to these issues. This lead into a conversation about the precise questions and hypotheses which attendees would like to be able to answer using either CRD IV data or public CBCR data more generally.

From left to right: Georgiana Bere (Open Knowledge International), Adam Kariv (Open Knowledge International), Vitor Baptista (Open Knowledge International).

As quickly as possible, the Open Knowledge International team wanted to give attendees the knowledge and tools they needed to be able to answer these questions. So our developers Georgiana Bere and Vitor Baptista demonstrated how anyone could take unstructured CRD IV information from tables published in the PDF version of banks’ annual reports and follow a process set out on the Github repo for the pilot database to contribute this data into a pipeline created by the Open Knowledge International team. Datapackage-pipelines is a framework – developed as part of the Frictionless Data toolchain – for defining data processing steps to generate self-describing Data Packages. Once attendees had contributed data into the pipeline via Github issues,  Vitor demonstrated how to write queries against this data using Redash in order to get answers to the questions they had posed earlier in the day.   Storytelling with CRD IV data Evidence-based, data-driven storytelling is an increasingly important mechanism used to inform and empower audiences, and encourage them to take action and push for positive change in the communities they live in. So our sprint focus on day two shifted to researching and drafting thematic stories using this data. Discussions around data quality are commonplace in working with open data. George Turner and Oliver Pearce noticed a recurring issue in the available data: the use of hyphens to denote both nil and unrecorded values. The two spent part of the day thinking about ways to highlight the issue and guidelines that can help overcome this challenge so as to avoid incorrect interpretations. Open data from a single source often has gaps so combining it with data from additional sources often helps with verification and to build a stronger narrative around it. In light of this, Elena Gaita, Dorcas Mensa and Jason Braganza narrowed their focus to examine a single organisation to see whether or not this bank changed its policy towards using tax havens following a 2012 investigative exposé by a British newspaper. They achieved this by comparing data from the investigation with the bank’s 2014 CRD IV disclosures. In the coming days, they hope to publish a blogpost detailing their findings on the extent to which the new transparency requirements have changed the bank’s tax behaviour.  

Visual network showing relation between top 50 banks and financial institutions who comply with Capital Requirements Directive IV (CRD IV) and countries in which they report profits. Image: Public Data Lab

To complement these story ideas, we explored visualisation tools which could help draw insights and revelations from the assembled CRD IV data. Visualisations often help to draw attention to aspects of the data that would have otherwise gone unnoticed. Oliver Pearce and George Turner studied the exploratory visual network of CRD IV data for the EU’s top 50 banks created by our friends at Density Design and the Public Data Lab (see screengrab above) to learn where banks were recording most profits and losses. Pearce and Turner quickly realised that one bank in particular recorded losses in all but one of its jurisdictions. In just a few minutes, the finding from this visual network sparked their interest and encouraged them to ask more questions. Was the lone profit-recording jurisdiction a tax haven? How did other banks operating in the same jurisdiction fare on the profit/loss scale in the same period? We look forward to reading their findings as soon as they are published.   What happens next? The Open Data for Tax Justice network team are now exploring opportunities for collaborations to collect and process all available CRD IV data via the pipeline and tools developed during our sprint. We are also examining options to resolve some of the data challenges experienced during the sprint like the perceived lack of an established codelist of tax jurisdictions and are searching for a standard exchange rate source which could be used across all recorded payments data. In light of the European Union Parliament’s recent vote in favour of requiring all large multinational corporations to publish public CBCR information as open data, we will be working with advocacy partners to join the ongoing discussion about the “common template” and “open data format” for future public CBCR disclosures which will be mandated by the EU. Having identified extractives industry data as another potential source of public CBCR to connect to our future database, we are also heartened to see the ongoing project between the Natural Resource Governance Institute and Publish What You Pay Canada so will liaise further with the team working on extracting data from these new disclosures. Please email contact@datafortaxjustice.net if you’d like to be added to the project mailing list or want to join the Open Data for Tax Justice network. You can also follow the #OD4TJ hashtag on Twitter for updates.   Thanks to our partners at Open Data for Development, Tax Justice Network, Financial Transparency Coalition and Public Data Lab for the funding and support which made this design sprint possible.               

The state of open licensing in 2017

Danny Lämmerhirt - June 8, 2017 in Global Open Data Index, Open Definition, Open Government Data, Open Knowledge

This blog post is part of our Global Open Data Index (GODI) blog series. Firstly, it discusses what open licensing is and why it is crucial for opening up data. Afterward, it outlines the most urgent issues around open licensing as identified in the latest edition of the Global Open Data Index and concludes with 10 recommendations how open data advocates can unlock this data. The blog post was jointly written by Danny Lämmerhirt and Freyja van den Boom.   Open data must be reusable by anyone and users need the right to access and use data freely, for any purpose. But legal conditions often block the effective use of data. Whoever wants to use existing data needs to know whether they have the right to do so. Researchers cannot use others’ data if they are unsure whether they would be violating intellectual property rights. For example, a developer wanting to locate multinational companies in different countries and visualize their paid taxes can’t do so unless they can find how this business information is licensed. Having clear and open licenses attached to the data, which allow for use with the least restrictions possible, are necessary to make this happen.   Yet, open licenses still have a long way to go. The Global Open Data Index (GODI) 2016/17 shows that only a small portion of government data can be used without legal restrictions. This blog post discusses the status of ‘legal’ openness. We start by explaining what open licenses are and discussing GODI’s most recent findings around open licensing. And we conclude by offering policy- and decisionmakers practical recommendations to improve open licensing.   What is an open license? As the Open Definition states, data is legally open “if the legal conditions under which data is provided allow for free use”.  For a license to be an open license it must comply with the conditions set out under the  Open Definition 2.1.  These legal conditions include specific requirements on use, non-discrimination, redistribution, modification, and no charge.   Why do we need open licenses? Data may fall under copyright protection. Copyright grants the author of an original work exclusive rights over that work. If you want to use a work under copyright protection you need to have permission. There are exceptions and limitations to copyright when permission is not needed for example when the data is in the ‘public domain’ it is not or no longer protected by copyright, or when your use is permitted under an exception.   Be aware that some countries also allow legal protection for databases which limit what use can be made of the data and the database. It is important to check what the national requirements are, as they may differ.   Because some types of data (papers, images) can fall under the scope of copyright protection we need data licensing. Data licensing helps solve problems in practice including not knowing whether the data is indeed copyright protected and how to get permission. Governments should therefore clearly state if their data is in the public domain or when the data falls under the scope of copyright protection what the license is.
  • When data is public domain it is recommended to use the CC0 Public Domain license for clarity.
  • When the data falls under the scope of copyright it is recommended to use an existing Open license such as CC-BY to improve interoperability.
Using Creative Commons or Open Data Commons licenses is best practice. Many governments already apply one of the Creative Commons licenses (see this wiki). Some governments have chosen however to write their own licenses or formulate ‘terms of use’ which grant use rights similar to widely acknowledged open licenses. This is problematic from the perspective of the user because of interoperability. The proliferation of ever more open government licenses has been criticized for a long time. By creating their own versions, governments may add unnecessary information for users, cause incompatibility and significantly reduce reusability of data.  Creative Commons licenses are designed to reduce these problems by clearly communicating use rights and to make the sharing and reuse of works possible.  

The state of open licensing in 2017

Initial results from the GODI 2016/17 show roughly that only 38 percent of the eligible datasets were openly licensed (this value may change slightly after the final publication on June 15). The other licenses include many use restrictions including use limitations to non-commercial purposes, restrictions on reuse and/or modifications of the data.     Where data is openly licensed, best practices are hardly ever followed In the majority of cases, our research team found governments apply general terms of use instead of specific licenses for the data. Open government licenses and Creative Commons licenses were seldom used. As outlined above, this is problematic. Using customized licenses or terms of use may impose additional requirements such as:
  • Require specific attribution statements desired by the publisher
  • Add clauses that make it unclear how data can be reused and modified.
  • Adapt licenses to local legislation
Throughout our assessment, we encountered unnecessary or ambivalent clauses, which in turn may cause legal concerns, especially when people consider to use data commercially. Sometimes we came across redundant clauses that cause more confusion than clarity.  For example clauses may forbid to use data in an unlawful way (see also the discussion here).   Standard open licenses are intended to reduce legal ambiguity and enable everyone to understand use rights. Yet many licenses and terms contain unclear clauses or are not obvious to what data they refer to. This can, for instance, mean that governments restrict the use of substantial parts of a database (and only allow the use of insignificant parts of it). We recommend that governments give clear examples which use cases are acceptable and which ones are not.   Licenses do not make clear enough to what data they apply.  Data should include a link to the license, but this is not commonly done. For instance, in Mexico, we found out that procurement information available via Compranet, the procurement platform for the Federal Government, was openly licensed, but the website does not state this clearly. Mexico hosts the same procurement data on datos.gob.mx and applies an open license to this data. As a government official told us, the procurement data is therefore openly licensed, regardless where it is hosted. But again this is not clear to the user who may find this data on a different website. Therefore we recommend to always have the data accompanied with a link to the license.  We also recommend to have a license notice attached or ‘in’ the data too. And to keep the links updated to avoid ‘link rot’.   The absence of links between data and legal terms makes an assessment of open licenses impossible Users may need to consult legal texts and see if the rights granted to comply with the open definition. Problems arise if there is not a clear explanation or translation available what specific licenses entail for the end user. One problem is that users need to translate the text and when the text is not in a machine-readable format they cannot use translation services. Our experience shows that it was a significant source of error in our assessment. If open data experts struggle to assess public domain status, this problem is even exacerbated for open data users. Assessing public domain status requires substantial knowledge of copyright – something the use of open licenses explicitly wants to avoid.   Copyright notices on websites can confuse users. In several cases, submitters and reviewers were unable to find any terms or conditions. In the absence of any other legal terms, submitters sometimes referred to copyright notices that they found in website footers. These copyright details, however, do not necessarily refer to the actual data. Often they are simply a standard copyright notice referring to the website.

Recommendations for data publishers

Based on our finding we prepared 10 recommendations that policymakers and other government officials should take into account:  
  1. Does the data and/or dataset fall under the scope of IP protection? Often government data does not fall under copyright protection and should not be presented as such. Governments should be aware and clear about the scope of intellectual property (IP) protection.
  2. Use standardized open licenses. Open licenses are easily understandable and should be the first choice. The Open Definition provides conformant licenses that are interoperable with one another.
  3. In some cases, governments might want to use a customized open government license. These should be as open as possible with the least restrictions necessary and compatible (see point 2). To guarantee a license is compatible, the best practice is to submit the license for approval under the Open Definition.
  4. Exactly pinpoint within the license what data it refers to and provide a timestamp when the data has been provided.
  5. Clearly, publish open licensing details next to the data. The license should be clearly attached to the data and be both human and machine-readable. It also helps to have a license notice ‘in’ the data.
  6. Maintain the links to licenses so that users can access license terms at all times.
  7. Highlight the license version and provide context how data can be used.
  8. Whenever possible, avoid restrictive clauses that are not included in standard licenses.
  9. Re-evaluate the web design and avoid confusing and contradictory copyright notices in website footers, as well as disclaimers and terms of use.
  10. When government data is in the public domain by default, make clear to end users what that means for them.