You are browsing the archive for Open Knowledge.

Making remote working work for you and your organisation

- March 19, 2020 in Open Knowledge, Open Knowledge Foundation

The coronavirus outbreak means that up to 20 per cent of the UK workforce could be off sick or self-isolating during the peak of an epidemic.

Millions of people may not be ill, but they will be following expert advice to stay away from their workplace to help prevent the spread of the virus.

There are clearly hundreds of roles where working from home simply isn’t possible, and questions are rightly being asked about ensuring people’s entitlement to sick pay.

But for a huge number of workers who are usually based in an office environment, remote working is a possibility – and is therefore likely to become the norm for millions.

With the economy in major trouble as evidenced by yesterday’s stock market falls, ensuring those who are fit and able can continue to work is important.

So employers should start today to prepare for efficient remote working as part of their coronavirus contingency planning.

Giant companies such as Twitter are already prepared. But this may be an entirely new concept for some firms.

The Open Knowledge Foundation which I lead has been successfully operating remote working for several years.

Our staff are based in their homes in countries across the world, including the UK, Portugal, Zimbabwe and Australia.

Remote working was new to me a year ago when I joined the organisation.

I had been based in the European Parliament for 20 years as an MEP for Scotland. I had a large office on the 13th floor of the Parliament in Brussels, with space for my staff, as well as an office in Strasbourg when we were based there. For most of my time as a politician, I also had an office in Fife where my team would deal with constituents’ queries.

Things couldn’t be more different today. I work from my home in Dunfermline, in front of my desktop computer, with two screens so that I can type on one and keep an eye on real-time alerts on another.

The most obvious advantage is being able to see more of my family. Being a politician meant a lot of time away from my husband and children, and I very much sympathise with MSPs such as Gail Ross and Aileen Campbell who have decided to stand down from Holyrood to see more of their loved ones. If we want our parliaments to reflect society, we need to address the existing barriers to public office.

Now in charge of a team spread around the world, using a number of technology tools to communicate with them, remote working has been a revelation for me.

Why couldn’t I have used those tools in the European Parliament and even voted remotely?

In the same way that Gail Ross has questioned why there wasn’t a way for her to vote remotely from Wick, hundreds of miles from Edinburgh, the same question must be asked of the European Parliament.

But for companies now planning remote working, it is vital to adopt effective methods.

Access to reliable Wi-Fi is key, but effective communication is critical. Without physical interaction, a virtual space with video calling is essential.

It is important to see the person when remote working and be able to interact as close as it would be face-to-face. This also avoids distraction and allows people to check in with each other.

We tend to do staff calls through our Slack channel and our weekly all-staff call is through Google Hangout.

All-staff calls – or all-hands call as we call them – are important if people are forced to work remotely. We do this once a week, but for some organisations morning calls will also become an essential part of the day.

Our monthly global network call is on an open source tool called Jitsi and I use Zoom for diary meetings.

If all else fails, we resort to Skype and WhatsApp.

In terms of how we share documents between the team, we use Google Drive. That means participants in conference calls can see and update an agenda and add action points in real-time, and make alterations or comments on documents such as letters which need to be checked by multiple people.

In the same way that our staff work and collaborate remotely, using technology to co-operate on a wider scale also goes to the heart of our vision for a future that is fair, free and open.

We live in a time when technological advances offer incredible opportunities for us all.

Open knowledge will lead to enlightened societies around the world, where everyone has access to key information and the ability to use it to understand and shape their lives; where powerful institutions are comprehensible and accountable; and where vital research information that can help us tackle challenges such as poverty and climate change is available to all.

Campaigning for this openness in society is what our day job entails.

But to achieve that we have first worked hard to bring our own people together using various technological options.

Different organisations will find different ways of making it work.

But what is important is to have a plan in place today.

This post was originally published by the Herald newspaper

Frictionless Public Utility Data: A Pilot Study

- March 18, 2020 in Open Knowledge

This blog post describes a Frictionless Data Pilot with the Public Utility Data Liberation project. Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by Zane Selvans, Christina Gosnell, and Lilly Winfree. The Public Utility Data Liberation project, PUDL, aims to make US energy data easier to access and use. Much of this data, including information about the cost of electricity, how much fuel is being burned, powerplant usage, and emissions, is not well documented or is in difficult to use formats. Last year, PUDL joined forces with the Frictionless Data for Reproducible Research team as a Pilot project to release this public utility data. PUDL takes the original spreadsheets, CSV files, and databases and turns them into unified Frictionless tabular data packages that can be used to populate a database, or read in directly with Python, R, Microsoft Access, and many other tools.   

What is PUDL?

The PUDL project, which is coordinated by Catalyst Cooperative, is focused on creating an energy utility data product that can serve a wide range of users. PUDL was inspired to make this data more accessible because the current US utility data ecosystem fragmented, and commercial products are expensive. There are hundreds of gigabytes of information available from government agencies, but they are often difficult to work with, and different sources can be hard to combine. PUDL users include researchers, activists, journalists, and policy makers. They have a wide range of technical backgrounds, from grassroots organizers who might only feel comfortable with spreadsheets, to PhDs with cloud computing resources, so it was important to provide data that would work for all users.  Before PUDL, much of this data was freely available to download from various sources, but it was typically messy and not well documented. This led to a lack of uniformity and reproducibility amongst projects that were using this data. The users were scraping the data together in their own way, making it hard to compare analyses or understand outcomes. Therefore, one of the goals for PUDL was to minimize these duplicated efforts, and enable the creation of lasting, cumulative outputs.

What were the main Pilot goals?

The main focus of this Pilot was to create a way to openly share the utility data in a reproducible way that would be understandable to PUDL’s many potential users. The first change Catalyst identified they wanted to make during the Pilot was with their data storage medium. PUDL was previously creating a Postgresql database as the main data output. However many users,  even those with technical experience, found setting up the separate database software a major hurdle that prevented them from accessing and using the processed data. They also desired a static, archivable, platform-independent format. Therefore, Catalyst decided to transition PUDL away from PostgreSQL, and instead try Frictionless Tabular Data Packages. They also wanted a way to share the processed data without needing to commit to long-term maintenance and curation, meaning they needed the outputs to continue being useful to users even if they only had minimal resources to dedicate to the maintenance and updates. The team decided to package their data into Tabular Data Packages and identified Zenodo as a good option for openly hosting that packaged data. Catalyst also recognized that most users only want to download the outputs and use them directly, and did not care about reproducing the data processing pipeline themselves, but it was still important to provide the processing pipeline code publicly to support transparency and reproducibility. Therefore, in this Pilot, they focused on transitioning their existing ETL pipeline from outputting a PostgreSQL database, that was defined using SQLAlchemy, to outputting datapackages which could then be archived publicly on Zenodo. Importantly, they needed this pipeline to maintain the metadata, information about data type, and database structural information that had already been accumulated. This rich metadata needed to be stored alongside the data itself, so future users could understand where the data came from and understand its meaning. The Catalyst team used Tabular Data Packages to record and store this metadata (see the code here: https://github.com/catalyst-cooperative/pudl/blob/master/src/pudl/load/metadata.py). Another complicating factor is that many of the PUDL datasets are fairly entangled with each other. The PUDL team ideally wanted users to be able to pick and choose which datasets they actually wanted to download and use without requiring them to download it all (currently about 100GB of data when uncompressed). However, they were worried that if single datasets were downloaded, the users might miss that some of the datasets were meant to be used together. So, the PUDL team created information, which they call “glue”,  that shows which datasets are linked together and that should ideally be used in tandem.  The cumulation of this Pilot was a release of the PUDL data (access it here – https://zenodo.org/record/3672068 and read the corresponding documentation here – https://catalystcoop-pudl.readthedocs.io/en/v0.3.2/), which includes integrated data from the EIA Form 860, EIA Form 923, The EPA Continuous Emissions Monitoring System (CEMS), The EPA Integrated Planning Model (IPM), and FERC Form 1.

What problems were encountered during this Pilot?

One issue that the group encountered during the Pilot was that the data types available in Postgres are substantially richer than those natively in the Tabular Data Package standard. However, this issue is an endemic problem of wanting to work with several different platforms, and so the team compromised and worked with the least common denominator.  In the future, PUDL might store several different sets of data types for use in different contexts, for example, one for freezing the data out into data packages, one for SQLite, and one for Pandas.  Another problem encountered during the Pilot resulted from testing the limits of the draft Tabular Data Package specifications. There were aspects of the specifications that the Catalyst team assumed were fully implemented in the reference (Python) implementation of the Frictionless toolset, but were in fact still works in progress. This work led the Frictionless team to start a documentation improvement project, including a revision of the specifications website to incorporate this feedback.  Through the pilot, the teams worked to implement new Frictionless features, including the specification of composite primary keys and foreign key references that point to external data packages. Other new Frictionless functionality that was created with this Pilot included partitioning of large resources into resource groups in which all resources use identical table schemas, and adding gzip compression of resources. The Pilot also focused on implementing more complete validation through goodtables, including bytes/hash checks, foreign keys checks, and primary keys checks, though there is still more work to be done here.

Future Directions

A common problem with using publicly available energy data is that the federal agencies creating the data do not use version control or maintain change logs for the data they publish, but they do frequently go back years after the fact to revise or alter previously published data — with no notification. To combat this problem, Catalyst is using data packages to encapsulate the raw inputs to the ETL process. They are setting up a process which will periodically check to see if the federal agencies’ posted data has been updated or changed, create an archive, and upload it to Zenodo. They will also store metadata in non-tabular data packages, indicating which information is stored in each file (year, state, month, etc.) so that there can be a uniform process of querying those raw input data packages. This will mean the raw inputs won’t have to be archived alongside every data release. Instead one can simply refer to these other versioned archives of the inputs. Catalyst hopes these version controlled raw archives will also be useful to other researchers. Another next step for Catalyst will be to make the ETL and new dataset integration more modular to hopefully make it easier for others to integrate new datasets. For instance, they are planning on integrating the EIA 861 and the ISO/RTO LMP data next. Other future plans include simplifying metadata storage, using Docker to containerize the ETL process for better reproducibility, and setting up a Pangeo  instance for live interactive data access without requiring anyone to download any data at all. The team would also like to build visualizations that sit on top of the database, making an interactive, regularly updated map of US coal plants and their operating costs, compared to new renewable energy in the same area. They would also like to visualize power plant operational attributes from EPA CEMS (e.g., ramp rates, min/max operating loads, relationship between load factor and heat rate, marginal additional fuel required for a startup event…).  Have you used PUDL? The team would love to hear feedback from users of the published data so that they can understand how to improve it, based on real user experiences. If you are integrating other US energy/electricity data of interest, please talk to the PUDL team about whether they might want to integrate it into PUDL to help ensure that it’s all more standardized and can be maintained long term. Also let them know what other datasets you would find useful (E.g. FERC EQR, FERC 714, PHMSA Pipelines, MSHA mines…).  If you have questions, please ask them on GitHub (https://github.com/catalyst-cooperative/pudl) so that the answers will be public for others to find as well.

Tracking the Trade of Octopus (and Packaging the Data)

- March 13, 2020 in Frictionless Data, Open Knowledge

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/. By Lily Zhao

Introduction

When I started graduate school, I was shocked to learn that seafood is actually the most internationally traded food commodity in the world. In fact, the global trade in fish is worth more than the trades of tea, coffee and sugar combined (Fisheries FAO, 2006). However, for many developing countries being connected to the global seafood market can be a double-edged sword. It is true global trade has the potential to redistribute some wealth and improve the livelihoods of fishers and traders in these countries. But it can also promote illegal trade and overfishing, which can harm the future sustainability of a local food source. Over the course of my master’s degree, I developed a passion for studying these issues, which is why I am excited to share with you my experience turning some of the data my collaborators into a packaged dataset using the Open Knowledge Foundation’s Datapackage tool. These data provide a snapshot into the global market for octopus and how it is traded throughout and between Kenya, Tanzania and Mozambique before heading to European markets. This research project was an international collaboration between the Stockholm Resilience Centre in Sweden, the National Institute for Medical Research, of Tanzania, Pwani University in Kilifi, Kenya and the School of Marine and Environmental Affairs at the University of Washington. These data eventually became my master’s thesis and this data package will complement a forthcoming publication of our findings. Specifically, these data are the prices and quantities at which middlemen in Tanzania and Kenya reported buying and selling octopus. These data are exciting because they not only inform our understanding of who is benefiting from the trade of octopus by also could assist in improving the market price octopus in Tanzania. This is because value chain information can help Tanzania’s octopus fishery along its path to Marine Stewardship Council seafood certification. Seafood that gets the Marine Stewardship Council Label gains a certain amount of credibility which in turn can increase profit. For developing countries, this seafood label can provide a monetary incentive for improving fisheries management. But before Tanzania’s octopus fishery can get certified, they will need to prove they can trace the flow of their octopus supply chain, and manage it sustainably. We hope that this packaged dataset will ultimately inform this effort.

Getting the data

To gather the data my field partner Chris and I went to 10 different fishing communities like this one. mtwara

Middlemen buy and sell seafood in Mtwara, Tanzania.

We went on to interview all the major exporters of octopus in both Tanzania and Kenya and spoke with company agents and octopus traders who bought their octopus from 570 different fishermen. With these interviews were able to account for about 95% of East Africa’s international octopus market share. Octopus

My research partner- Chris Cheupe, and I at an octopus collection point.

Creating the Data Package

The datapackage tool was created by the Open Knowledge Foundation to compile our data and metadata in a compact unit, making it easier and more efficient for others to access. You can create the data package using the online platform or using the Python or R programming software libraries. I had some issues using the R package instead of the online tool initially, which may have been related to the fact that the original data file was not utf-8 encoded. But stay tuned! For now, I made my datapackage using the Data Package Creator online tool. The tool helped me create a schema that outlines the data’s structure including a description of each column. The tool also helps you outline the metadata for the dataset as a whole, including information like the license and author. Our dataset has a lot of complicated columns and the tool gave me a streamlined way to describe each column via the schema. Afterwords, I added the metadata using the lefthand side of the browser tool and checked to make sure that the data package was valid!   valid data

The green bar at the top of the screenshot indicates validity

If the information you provide for each column does not match the data within the columns the package will not validate and instead, you will get an error like this: invalid data  

The red bar at the top of the screenshot indicates invalidity

Checkout my final datapackage by visiting my github repository!

Reference:

Fisheries, F. A. O. (2006). The state of world fisheries and aquaculture 2006.

Celebrating the tenth Open Data Day on Saturday 7th March 2020

- March 6, 2020 in Open Data Day, Open Data Day 2020, Open Knowledge

Open Data Day 2020 In Ghana, satellite and drone imagery is being used to track deforestation and water pollution in West Africa. In South Africa, the first map of minibus taxi routes in a township in Pretoria is being created. In the Philippines, a map is being designed to highlight HIV facilities and LGBT-friendly spaces, while a similar project is underway in Granada to assess the housing situation of migrant women. And in Mexico, construction projects are being analysed to check their impact on the local environment. All these community-led projects, and many more like it, are improving lives for people in some of the world’s most deprived areas. They are all linked by one thing: open data. This Saturday is the tenth annual Open Data Day, which celebrates its transformational impact around the globe. Open data is data that can be freely accessed, used, modified and shared by anyone. It is the opposite of personal data, which must be kept private and there have rightly been concerns raised about how that is used by giant technology firms. Open data is altogether different – this is non-personal information, and it can and should be used for the public good. It is the building block of what is called ‘open knowledge’, which is what data can become if it is useful, usable and used. The key features of openness are availability and access, reuse and redistribution and universal participation. Open Data Day is an opportunity to show its benefits and encourage the adoption of open data policies in government, business and civil society. The Open Knowledge Foundation operates a mini-grants scheme for community projects every year, and in 2020 we are supporting 65 events taking place all over the world including in Argentina, Bolivia, Brazil, Cameroon, Colombia, Costa Rica, Germany, Ghana, Guatemala, Indonesia, Kenya, Malawi, Mexico, Nigeria, Somalia, South Africa, Tanzania, Togo and Venezuela. With the climate crisis now an emergency, open data can help tackle deforestation and monitor air pollution levels on our streets. It is being used in places such as the Democratic Republic of the Congo to increase young people’s knowledge of free local HIV-related services. In Nepal, streetlights data for Kathmandu has been collected by digital volunteers to influence policy for the maintenance of streetlights. The possibilities are endless. Open data can track the flow of public money, expanding budget transparency, examining tax data and raising issues around public finance management. And it can be used by communities to highlight pressing issues on a local, national or global level, such as progress towards the United Nations’ Sustainable Development Goals. I know that phrases like ‘open data’ and ‘open knowledge’ are not widely understood. With partners across the world, we are working to change that. This decade and the decades beyond are not to be feared. We live in a time when technological advances offer incredible opportunities for us all. This is a time to be hopeful about the future, and to inspire those who want to build a better society. Open knowledge will lead to enlightened societies around the world, where everyone has access to key information and the ability to use it to understand and shape their lives; where powerful institutions are comprehensible and accountable; and where vital research information that can help us tackle challenges such as poverty and climate change is available to all: a fair, free and open future. • The tenth Open Data Day will take place on Saturday 7th March 2020 with celebrations happening all over the world. Find out more at opendataday.org, discover events taking place near you and follow the conversation online via the hashtags #OpenDataDay and #ODD2020.

Breaking up big tech isn’t enough. We need to break them open

- February 27, 2020 in Open Knowledge, Open Knowledge Foundation, personal-data

From advocates, politicians and technologists, calls for doing something about big tech grow louder by the day. Yet concrete ideas are few or failing to reach the mainstream. This post covers what breaking up big tech would mean and why it’s not enough. I propose an open intervention that will give people a real choice and a way out of controlled walled gardens. Google, Facebook, Amazon and Apple are not natural monopolies and we need to regulate them to support competition and alternative business models.

What’s the problem?

As a social species, our social and digital infrastructure is of vital importance. Just think of the postal service that even in the most extreme circumstances, would deliver letters to soldiers fighting on the front lines. There’s a complicated and not-risk-free system that makes this work, and we make it work, because it matters. It is so important for us to keep in touch with our loved ones, stay connected with news and what’s happening in our communities, countries and the planet. Our ability to easily and instantly collaborate and work with people halfway across the world is one of the wonders of the Information Age. The data we collect can help us make better decisions about our environment, transport, healthcare, education, governance and planning. It should be used to support the flourishing of all people and our planet. But right now, so much of this data, so much of our social digital infrastructure, is owned, designed and controlled by a tiny elite of companies, driven by profit. We’re witnessing the unaccountable corporate capture of essential services, reliance on exploitative business models and the increasing dominance of big tech monopolies. Amazon, Facebook, Google, Apple and Microsoft use their amassed power to subvert the markets which they operate within, stifling competition and denying us real choice. Amazon has put thousands of companies out of business, leaving them the option to sell on their controlled platform or not sell at all. Once just a digital bookstore, Amazon now controls over 49% of the US digital commerce market (and growing fast) — selling everything from sex toys to cupcakes. Facebook (who, remember, also own Instagram and WhatsApp) dominates social, isolating people who don’t want to use their services. About a fifth of the population of the entire planet (1.6 billion) log in daily. They control a vast honeypot of personal data, vulnerable to data breaches, influencing elections and enabling the spread of misinformation. It’s tough to imagine a digital industry Google doesn’t operate in. These companies are too big, too powerful and too unaccountable. We can’t get them to change their behaviour. We can’t even get them to pay their taxes. And it’s way past time to do something about this.

Plans to break up monopolies

Several politicians are calling for breaking up big tech. In the USA, presidential candidate Elizabeth Warren wants two key interventions. One is to reverse some of the bigger controversial mergers and acquisitions which have happened over the last few years, such as Facebook with WhatsApp and Instagram, while going for a stricter interpretation and enforcement of anti-trust law. The other intervention is even more interesting, and an acknowledgement of how much harm comes from monopolies who are themselves intermediaries between producers and consumers. Elizabeth Warren wants to pass “legislation that requires large tech platforms to be designated as ‘Platform Utilities’ and broken apart from any participant on that platform”. This would mean that Amazon, Facebook or Google could not both be the platform provider and sell their own services and products through the platform. The EU has also taken aim at such platform power abuse. Google was fined €2.4 billion by the European Commission for denying “consumers a genuine choice by using its search engine to unfairly steer them to its own shopping platform”. Likewise, Amazon is currently under formal investigation for using their privileged access to their platform data to put out competing products and outcompete other companies’ products. Meanwhile, in India, a foreign-owned company like Amazon is already prohibited from being a vendor on their own electronic market place.

Breaking up big tech is not enough

While break up plans will go some way to address the unhealthy centralisation of data and power, the two biggest problems with big tech monopolies will remain:
  1. It won’t give us better privacy or change the surveillance business models used by tech platforms; and
  2. It won’t provide genuine choice or accountability, leaving essential digital services under the control of big tech.
The first point relates to the toxic and anti-competitive business models increasingly known as ‘Surveillance capitalism’. Smarter people than me have written about the dangers and dark patterns that emerge from this practice. When the commodity these companies profit from is your time and attention, these multi-billion companies are incentivised to hook you, manipulate you and keep dialing up the rampant consumerism which is destroying our planet. Our privacy and time is constantly exploited for profit. The break ups Warren proposes won’t change this. The second point means it still wouldn’t become it easier for other companies to compete or to experiment with alternative business models. Right now, it’s near impossible to compete with Facebook and Amazon since their dominance is built on ‘network effects’. Both companies strictly police their user network and data. People aren’t choosing these platforms because they are better, they default to them because that’s where everyone else is. Connectivity and reach is vital for people to communicate, share, organise and sell — there’s no option but to go where most people already are. So we’re increasingly locked in. We need to make it possible for other providers and services to thrive.

Breaking big tech open

Facebook’s numerous would-be competitors don’t fail through not being good enough or failing to get traction, or even funding. Path was beautiful and had many advantages over Facebook. Privacy-preserving Diaspora got a huge amount of initial attention. Scuttlebutt has fantastic communities. Alternatives do exist. None of them have reduced the dominance of Facebook. The problem is not a lack of alternatives, the problem is closed design, business model and network effects. What Facebook has, that no rival has, is all your friends. And where it keeps them is in a walled off garden which Facebook controls. No one can interact with Facebook users without having a Facebook account and agreeing to Facebook’s terms and conditions (aka surveillance and advertising). Essentially, Facebook owns my social graph and decides on what terms I can interact with my friends. The same goes for other big social platforms: to talk to people on LinkedIn, I have to have a LinkedIn account; to follow people on Twitter, I must first sign up to Twitter and so on. As users we take on the burden of maintaining numerous accounts, numerous passwords, sharing our data and content with all of these companies, on their terms. It doesn’t have to be this way. These monopolies are not natural, they are monopolies by design — choosing to run on closed protocols and walling off their users in silos. We need to regulate Facebook and others to force them to open up their application programme interfaces (APIs) to make it possible for users to have access to each other across platforms and services.

Technically, interoperability is possible

There are already examples of digital social systems which don’t operate as walled gardens: email for example. We don’t expect Google to refuse to deliver an email simply because we use an alternative email provider. If I send an email to a Gmail account from my Protonmail, FastMail or even Hotmail account — it goes through. It just works. No message about how I first have to get a Gmail account. This, fundamentally, is the reason email has been so successful for so long. Email uses an open protocol, supported by Google, Microsoft and others (probably due to being early enough, coming about in the heady open days of the web, before data mining and advertising became the dominant forces they are today … although email is increasingly centralised and dominated by Google). While email just works, a technology that’s very similar, such as instant messaging, doesn’t. We have no interoperability, which means many of us have upward of four different chat apps on our phones and have to remember which of our friends are on Twitter, Facebook, WhatsApp (owned by Facebook), Signal, Wire, Telegram, etc.  We don’t carry around five phones so why do we maintain accounts with so many providers, each storing our personal details, each with a different account and password to remember? This is because these messaging apps use their own, closed, proprietary protocols and harm usability and accessibility in the process. This is not in the interests of most people. Interoperability and the use of open protocols would transform this, offering us a better experience and control over our data while reducing our reliance on any one platform. Open protocols can form the basis of a shared digital infrastructure that’s more resilient and would help us keep companies that provide digital services, accountable. It would make it possible to leave and choose whose services we use.

What would this look like in practice?

Say I choose to use a privacy-preserving service for instant messaging, photo sharing and events — possibly one of the many currently available today, or even something I’ve built or host myself. I create an event and I want to invite all my friends, wherever they are. This is where the open protocol and interoperability come in. I have some friends using the same service as me, many more scattered across Twitter, Facebook and other social services, maybe a few just on email. If these services allow interconnections with other services, then every person, wherever they are, will get my event invite and be able to RSVP, receive updates and possibly even comment (depending on what functionality the platforms support). No more getting left out as the cost of caring about privacy. Interoperability would be transformational. It would mean that:
  1. I can choose to keep my photos and data where I have better access, security and portability. This gives us greater control over our data and means that…
  2. Surveillance is harder and more expensive to do. My data will not all be conveniently centralised for corporations or governments to use in unaccountable ways I haven’t agreed to. Privacy ❤
  3. I won’t lose contact with, leave out, or forget friends who aren’t on the same platform as me. I can choose services which serve my needs better, not based on the fear of social exclusion or missing out. Hooray for inclusion and staying friends!
  4. I’ll be less stressed trying to remember and contact people across different platforms with different passwords and accounts (e.g. this currently requires a Facebook event, email, tweets, WhatsApp group reminders and Mastodon, Diaspora and Scuttlebutt posts for siloed communities…)
  5. Alternative services, and their alternative business models and privacy policies become much more viable! Suddenly, a whole ecosystem of innovation and experimentation is possible which is out of reach for us today. (I’m not saying it will be easy. Finding sustainable funding and non-advertising-based business models will still be hard and will require more effort and systemic interventions, but this is a key ingredient).
Especially this last point, the viability of creating alternatives, would start shifting the power imbalance between Facebook and its users (and regulators), making Facebook more accountable and incentivising them to be responsive to user wants and needs. Right now Facebook acts as it pleases because it can — it knows its users are trapped. As soon as people have meaningful choice, exploitation and abuse become much harder and more expensive to maintain.

So, how do we get there?

In the first instance, regulating Facebook, Twitter and others to make them open up their APIs so that other services can read/write to Facebook events, groups, messages etc. would be the first milestone. Yes, this isn’t trivial and there are questions to work out, but it can be done. Looking ahead, investing now in developing open standards for our social digital infrastructure is a must. Funders and governments should be supporting the work and adoption of open protocols and standards — working with open software and services to refine, test and use these standards and see how they work in practice over time. We’ll need governance mechanisms for evolving and investing in our open digital infrastructure that includes diverse stakeholders and accounts for power imbalances between them. We use platforms which have not been co-designed by us and on terms and conditions we have little say over. Investment into alternatives have largely failed outside of more authoritarian countries that have banned or blocked the likes of Google and Facebook. We need to do more to ensure our data and essential services are not in the hands of one or two companies, too big to keep accountable. And after many years of work and discussions on this, I believe openness and decentralisation must play a central role. Redecentralize.org and friends are working on a campaign to figure out how to make this a reality. Is this something you’re working on already or want to contribute and get invited to future workshops and calls? Then ping me on hello@redecentralize.org. The opportunity is huge. By breaking big tech open, we can build a fairer digital future for all, so come get involved! • This blogpost is an reposted version of a post originally published on the Redecentralize blog

Combating other people’s data

- February 18, 2020 in Frictionless Data, Open Knowledge

Frictionless Data Pipelines for Ocean Science

- February 10, 2020 in Frictionless Data, Open Knowledge

This blog post describes a Frictionless Data Pilot with the Biological and Chemical Oceanography Data Management Office (BCO-DMO). Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by the BCO-DMO team members Adam Shepherd, Amber York, Danie Kinkade, and development by Conrad Schloer.   Scientific research is implicitly reliant upon the creation, management, analysis, synthesis, and interpretation of data. When properly stewarded, data hold great potential to demonstrate the reproducibility of scientific results and accelerate scientific discovery. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) is a publicly accessible earth science data repository established by the National Science Foundation (NSF) for the curation of biological, chemical, and biogeochemical oceanographic data from research in coastal, marine, and laboratory environments. With the groundswell surrounding the FAIR data principles, BCO-DMO recognized an opportunity to improve its curation services to better support reproducibility of results, while increasing process efficiencies for incoming data submissions. In 2019, BCO-DMO worked with the Frictionless Data team at Open Knowledge Foundation to develop a web application called Laminar for creating Frictionlessdata Data Package Pipelines that help data managers process data efficiently while recording the provenance of their activities to support reproducibility of results.  
The mission of BCO-DMO is to provide investigators with data management services that span the full data lifecycle from data management planning, to data publication, and archiving.

BCO-DMO provides free access to oceanographic data through a web-based catalog with tools and features facilitating assessment of fitness for purpose. The result of this effort is a database containing over 9,000 datasets from a variety of oceanographic and limnological measurements including those from: in situ sampling, moorings, floats and gliders, sediment traps; laboratory and mesocosm experiments; satellite images; derived parameters and model output; and synthesis products from data integration efforts. The project has worked with over 2,600 data contributors representing over 1,000 funded projects.  As the catalog of data holdings continued to grow in both size and the variety of data types it curates, BCO-DMO needed to retool its data infrastructure with three goals. First, to improve the transportation of data to, from, and within BCO-DMO’s ecosystem. Second, to support reproducibility of research by making all curation activities of the office completely transparent and traceable. Finally, to improve the efficiency and consistency across data management staff. Until recently, data curation activities in the office were largely dependent on the individual capabilities of each data manager. While some of the staff were fluent in Python and other scripting languages, others were dependent on in-house custom developed tools. These in-house tools were extremely useful and flexible, but they were developed for an aging computing paradigm grounded in physical hardware accessing local data resources on disk. While locally stored data is still the convention at BCO-DMO, the distributed nature of the web coupled with the challenges of big data stretched this toolset beyond its original intention. 
In 2015, we were introduced to the idea of data containerization and the Frictionless Data project in a Data Packages BoF at the Research Data Alliance conference in Paris, France. After evaluating the Frictionless Data specifications and tools, BCO-DMO developed a strategy to underpin its new data infrastructure on the ideas behind this project.
While the concept of data packaging is not new, the simplicity and extendibility of the Frictionless Data implementation made it easy to adopt within an existing infrastructure. BCO-DMO identified the Data Package Pipelines (DPP) project in the Frictionless Data toolset as key to achieving its data curation goals. DPP implements the philosophy of declarative workflows which trade code in a specific programming language that tells a computer how a task should be completed, for imperative, structured statements that detail what should be done. These structured statements abstract the user writing the statements from the actual code executing them, and are useful for reproducibility over long periods of time where programming languages age, change or algorithms improve. This flexibility was appealing because it meant the intent of the data manager could be translated into many varying programming (and data) languages over time without having to refactor older workflows. In data management, that means that one of the languages a DPP workflow captures is provenance – a common need across oceanographic datasets for reproducibility. DPP Workflows translated into records of provenance explicitly communicates to data submitters and future data users what BCO-DMO had done during the curation phase. Secondly, because workflow steps need to be interpreted by computers into code that carries out the instructions, it helped data management staff converge on a declarative language they could all share. This convergence meant cohesiveness, consistency, and efficiency across the team if we could implement DPP in a way they could all use.  In 2018, BCO-DMO formed a partnership with Open Knowledge Foundation (OKF) to develop a web application that would help any BCO-DMO data manager use the declarative language they had developed in a consistent way. Why develop a web application for DPP? As the data management staff evaluated DPP and Frictionless Data, they found that there was a learning curve to setting up the DPP environment and a deep understanding of the Frictionlessdata ‘Data Package’ specification was required. The web application abstracted this required knowledge to achieve two main goals: 1) consistently structured Data Packages (datapackage.json) with all the required metadata employed at BCO-DMO, and 2) efficiencies of time by eliminating typos and syntax errors made by data managers.  Thus, the partnership with OKF focused on making the needs of scientific research data a possibility within the Frictionless Data ecosystem of specs and tools. 
Data Package Pipelines is implemented in Python and comes with some built-in processors that can be used in a workflow. BCO-DMO took its own declarative language and identified gaps in the built-in processors. For these gaps, BCO-DMO and OKF developed Python implementations for the missing declarations to support the curation of oceanographic data, and the result was a new set of processors made available on Github.
Some notable BCO-DMO processors are: boolean_add_computed_field – Computes a new field to add to the data whether a particular row satisfies a certain set of criteria.
Example: Where Cruise_ID = ‘AT39-05’ and Station = 6, set Latitude to 22.1645. convert_date – Converts any number of fields containing date information into a single date field with display format and timezone options. Often data information is reported in multiple columns such as `year`, `month`, `day`, `hours_local_time`, `minutes_local_time`, `seconds_local_time`. For spatio-temporal datasets, it’s important to know the UTC date and time of the recorded data to ensure that searches for data with a time range are accurate. Here, these columns are combined to form an ISO 8601-compliant UTC datetime value. convert_to_decimal_degrees –  Convert a single field containing coordinate information from degrees-minutes-seconds or degrees-decimal_minutes to decimal_degrees. The standard representation at BCO-DMO for spatial data conforms to the decimal degrees specification.
reorder_fields –  Changes the order of columns within the data. This is a convention within the oceanographic data community to put certain columns at the beginning of tabular data to help contextualize the following columns. Examples of columns that are typically moved to the beginning are: dates, locations, instrument or vessel identifiers, and depth at collection.  The remaining processors used by BCO-DMO can be found at https://github.com/BCODMO/bcodmo_processors

How can I use Laminar?

In our collaboration with OKF, BCO-DMO developed use cases based on real-world data submissions. One such example is a recent Arctic Nitrogen Fixation Rates dataset.   Arctic dataset  The original dataset shown above needed the following curation steps to make the data more interoperable and reusable:
  • Convert lat/lon to decimal degrees
  • Add timestamp (UTC) in ISO format
  • ‘Collection Depth’ with value “surface” should be changed to 0
  • Remove parenthesis and units from column names (field descriptions and units captured in metadata).
  • Remove spaces from column names
The web application, named Laminar, built on top of DPP helps Data Managers at BCO-DMO perform these operations in a consistent way. First, Laminar prompts us to name and describe the current pipeline being developed, and assumes that the data manager wants to load some data in to start the pipeline, and prompts for a source location. Laminar After providing a name and description of our DPP workflow, we provide a data source to load, and give it the name, ‘nfix’.  In subsequent pipeline steps, we refer to ‘nfix’ as the resource we want to transform. For example, to convert the latitude and longitude into decimal degrees, we add a new step to the pipeline, select the ‘Convert to decimal degrees’ processor, a proxy for our custom processor convert_to_decimal_degrees’, select the ‘nfix’ resource, select a field form that ‘nfix’ data source, and specify the Python regex pattern identifying where the values for the degrees, minutes and seconds can be found in each value of the latitude column. processor step Similarly, in step 7 of this pipeline, we want to generate an ISO 8601-compliant UTC datetime value by combining the pre-existing ‘Date’ and ‘Local Time’ columns. This step is depicted below: date processing step After the pipeline is completed, the interface displays all steps, and lets the data manager execute the pipeline by clicking the green ‘play’ button at the bottom. This button then generates the pipeline-spec.yaml file, executes the pipeline, and can display the resulting dataset. all steps   data The resulting DPP workflow contained 223 lines across this 12-step operation, and for a data manager, the web application reduces the chance of error if this pipelines was being generated by hand. Ultimately, our work with OKF helped us develop processors that follow the DPP conventions.
Our goal for the pilot project with OKF was to have BCO-DMO data managers using the Laminar for processing 80% of the data submissions we receive. The pilot was so successful, that data managers have processed 95% of new data submissions to the repository using the application.
This is exciting from a data management processing perspective because the use of Laminar is more sustainable, and acted to bring the team together to determine best strategies for processing, documentation, etc. This increase in consistency and efficiency is welcomed from an administrative perspective and helps with the training of any new data managers coming to the team.  The OKF team are excellent partners, who were the catalysts to a successful project. The next steps for BCO-DMO are to build on the success of The Fricitonlessdata  Data Package Pipelines by implementing the Frictionlessdata Goodtables specification for data validation to help us develop submission guidelines for common data types. Special thanks to the OKF team – Lilly Winfree, Evgeny Karev, and Jo Barrett. 

Frictionless Data Tool Fund update: Shelby Switzer and Greg Bloom, Open Referral

- January 15, 2020 in Data Package, Frictionless Data, Open Knowledge

This blogpost is part of a series showcasing projects developed during the 2019 Frictionless Data Tool Fund. The 2019 Frictionless Data Tool Fund provided four mini-grants of $5,000 to support individuals or organisations in developing an open tool for reproducible research built using the Frictionless Data specifications and software. This fund is part of the Frictionless Data for Reproducible Research project, which is funded by the Sloan Foundation. This project applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts.     Open Referral Logo   Open Referral creates standards for health, human, and social services data – the data found in community resource directories used to help find resources for people in need. In many organisations, this data lives in a multitude of formats, from handwritten notes to Excel files on a laptop to Microsoft SQL databases in the cloud. For community resource directories to be maximally useful to the public, this disparate data must be converted into an interoperable format. Many organisations have decided to use Open Referral’s Human Services Data Specification (HSDS) as that format. However, to accurately represent this data, HSDS uses multiple linked tables, which can be challenging to work with. To make this process easier, Greg Bloom and Shelby Switzer from Open Referral decided to implement datapackage bundling of their CSV files using the Frictionless Data Tool Fund.  In order to accurately represent the relationships between organisations, the services they provide, and the locations they are offered, Open Referral aims to use their Human Service Data Specification (HSDS) makes sense of disparate data by linking multiple CSV files together by foreign keys. Open Referral used Frictionless Data’s datapackage to specify the tables’ contents and relationships in a single machine-readable file, so that this standardised format could transport HSDS-compliant data in a way that all of the teams who work with this data can use: CSVs of linked data.  In the Tool Fund, Open Referral worked on their HSDS Transformer tool, which enables a group or person to transform data into an HSDS-compliant data package, so that it can then be combined with other data or used in any number of applications. The HSDS-Transformer is a Ruby library that can be used during the extract, transform, load (ETL) workflow of raw community resource data. This library extracts the community resource data, transforms that data into HSDS-compliant CSVs, and generates a datapackage.json that describes the data output. The Transformer can also output the datapackage as a zip file, called HSDS Zip, enabling systems to send and receive a single compressed file rather than multiple files. The Transformer can be spun up in a docker container — and once it’s live, the API can deliver a payload that includes links to the source data and to the configuration file that maps the source data to HSDS fields. The Transformer then grabs the source data and uses the configuration file to transform the data and return a zip file of the HSDS-compliant datapackage.  HSDS demo app

Example of a demo app consuming the API generated from the HSDS Zip

The Open Referral team has also been working on projects related to the HSDS Transformer and HSDS Zip. For example, the HSDS Validator checks that a given datapackage of community service data is HSDS-compliant. Additionally, they have used these tools in the field with a project in Miami. For this project, the HSDS Transformer was used to transform data from a Microsoft SQL Server into an HSDS Zip. Then that zipped datapackage was used to populate a Human Services Data API with a generated developer portal and OpenAPI Specification.   Further, as part of this work, the team also contributed to the original source code for the datapackage-rb Ruby gem. They added a new feature to infer a datapackage.json schema from a given set of CSVs, so that you can generate the json file automatically from your dataset. Greg and Shelby are eager for the Open Referral community to use these new tools and provide feedback. To use these tools currently, users should either be a Ruby developer who can use the gem as part of another Ruby project, or be familiar enough with Docker and HTTP APIs to start a Docker container and make an HTTP request to it. You can use the HSDS Transformer as a Ruby gem in another project or as a standalone API. In the future, the project might expand to include hosting the HSDS Transformer as a cloud service that anyone can use to transform their data, eliminating many of these technical requirements. Interested in using these new tools? Open Referral wants to hear your feedback. For example, would it be useful to develop an extract-transform-load API, hosted in the cloud, that enables recurring transformation of nonstandardised human service directory data source into an HSDS-compliant datapackage? You can reach them via their GitHub repos. Further reading: openreferral.org Repository: https://github.com/openreferral/hsds-transformer HSDS Transformer: https://openreferral.github.io/hsds-transformer/ 

Neuroscience Experiments System Frictionless Tool

- December 16, 2019 in Frictionless Data, Open Knowledge

This blog is part of a series showcasing projects developed during the 2019 Frictionless Data Tool Fund.  The 2019 Frictionless Data Tool Fund provided four mini-grants of $5,000 to support individuals or organisations in developing an open tool for reproducible research built using the Frictionless Data specifications and software. This fund is part of the Frictionless Data for Reproducible Research project, which is funded by the Sloan Foundation. This project applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts.  

NES logo

Neuroscience Experiments System Frictionless Data Incorporation, by the Technology Transfer team of the Research, Innovation and Dissemination Center for Neuromathematics.

  The Research, Innovation and Dissemination Center for Neuromathematics (RIDC NeuroMat) is a research center established in 2013 by the São Paulo Research Foundation (FAPESP) at the University of São Paulo, in Brazil. A core mission of NeuroMat is the development of open-source computational tools to aid in scientific dissemination and advance open knowledge and open science. To this end, the team has created the Neuroscience Experiments System (NES), which is an open-source tool to assist neuroscience research laboratories in routine procedures for data collection. To more effectively understand the function and treatment of brain pathologies, NES aids in recording data and metadata from various experiments, including clinical data, electrophysiological data, and fundamental provenance information. NES then stores that data in a structured way, allowing researchers to seek and share data and metadata from those neuroscience experiments.  For the 2019 Tool Fund, the NES team, particularly João Alexandre Peschanski, Cassiano dos Santos and Carlos Eduardo Ribas, proposed to adapt their existing export component to conform to the Frictionless Data specifications.   Public databases are seen as crucial by many members of the neuroscience community as a means of moving science forward. However, simply opening up data is not enough; it should be created in a way that can be easily shared and used. For example, data and metadata should be readable by both researchers and machines, yet they typically are not. When the NES team learned about Frictionless Data, they were interested in trying to implement the specifications to help make the data and metadata in NES machine readable.  For them, the advantage of the Frictionless Data approach was to be able to standardize data opening and sharing within the neuroscience community.   Before the Tool Fund, NES had an export component that set up a file with folders and documents with information on an entire experiment (including data collected from participants, device metadata, questionnaires, etc. ), but they wanted to improve this export to be more structured and open. By implementing Frictionless Data specifications, the resulting export component includes the Data Package (datapackage.json) and the folders/files inside the archive, with a root folder called data. With this new “frictionless” export component, researchers can transport and share their export data with other researchers in a recognized open standard format (the Data Package), facilitating the understanding of that exported data. They have also implemented Goodtables into the unit tests to check data structure.   The RIDC NeuroMat team’s expectation is that many researchers,  particularly neuroscientists and experimentalists, will have an interest in using the freely available NES tool. With the anonymization of sensitive information, the data collected using NES can be publicly available through the NeuroMat Open Database, allowing any researcher to reproduce the experiment or simply use the data in a different study. In addition to storing collected experimental data and being a tool for guiding and documenting all the steps involved in a neuroscience experiment, NES has an integration with the Neuroscience Experiment Database, another NeuroMat project, based on a REST API, where NES users can send their experiments to become publicly available for other researchers to reproduce them or to use as inspiration for further experiments. Screenshot of the export of an experiment: NES export   Screenshot of the export of data on participants:   Picture of a hypothetical export file tree of type Per Experiment after the Frictionless Data implementation: NES data   Further reading: Repository: https://github.com/neuromat/nes User manual: https://nes.readthedocs.io/en/latest/ NeuroMat blog: https://neuromat.numec.prp.usp.br/ Post on NES at the NeuroMat blog: https://neuromat.numec.prp.usp.br/content/a-pathway-to-reproducible-science-the-neuroscience-experiments-system/

Announcing Frictionless Data Joint Stewardship

- December 12, 2019 in Frictionless Data, Open Knowledge

We are pleased to announce joint stewardship of Frictionless Data between the Open Knowledge Foundation and Datopian. While this collaboration already exists informally, we are solidifying how we are leading together on future Frictionless Data projects and goals.   What does this mean for users of Frictionless Data software and specifications?   First, you will continue to see a consistent level of activity and support from Open Knowledge Foundation, with a particular focus on the application of Frictionless Data for reproducible research, as part of our three-year project funded by the Sloan Foundation. This also includes specific contributions in the development of the Frictionless Data specifications under the leadership of Rufus Pollock, Datopian President and Frictionless Data creator, and Paul Walsh, Datopian CEO and long-time contributor to the specifications and software.   Second, there will be increased activity in software development around the specifications, with a larger team across both organisations contributing to key codebases such as Good Tables, and the various integrations with backend storage systems such as Elasticsearch, BigQuery, and PostgreSQL, and data science tooling such as PandasAdditionally, based on their CKAN commercial services work, and co-stewardship of the CKAN project, Datopian look forward to providing more integrations of Frictionless Data with CKAN, building on existing work done at the Open Knowledge Foundation.    Our first joint project is redesigning the Frictionless Data website. Our goal is to make the project more understandable, usable, and user-focused. At this point, we are actively seeking user input, and are requesting interviews to help inform the new design. Have you used our website and are interested in having your opinion heard? Please get in touch to give us your ideas and feedback on the site. Focusing on user needs is a top goal for this project.   Ultimately, we are focused on leading the project openly and transparently, and are excited by the opportunities that clarification of the leadership of the project will provide. We want to emphasize that the Frictionless Data project is community focused, meaning that we really value to input and participation of our community of users. We encourage you to reach out to us on Discuss, in Gitter, or open issues in GitHub with your ideas or problems. Datopian   OKF logo