You are browsing the archive for Open Knowledge.

Lessons learned from organising the first ever virtual csv,conf

- June 17, 2020 in #CSVconf, Events, Open Knowledge, Open Knowledge Foundation

This blogpost was collaboratively written by the csv,conf organising team which includes Lilly Winfree and Jo Barratt from the Open Knowledge Foundation. csv,conf is supported by the Sloan Foundation as part of our Frictionless Data for Reproducible Research grant.

A brief history

csv,conf is a community conference that brings diverse groups together to discuss data topics, and features stories about data sharing and data analysis from science, journalism, government, and open source. Over the years we have had over a hundred different talks from a huge range of speakers, most of which you can still watch back on our YouTube Channel.

csv,conf,v1 took place in Berlin in 2014 and we were there again for v2 in 2016 before we moved across the Atlantic for v3 and v4 which were held in Portland, Oregon in the United States in 2017 and 2019. For csv,conf,v5, we were looking forward to our first conference in Washington DC, but unfortunately, like many other in-person events, this was not going to be possible in 2020. People have asked us about our experience moving from a planned in-person event to one online, in a very short space of time, so we are sharing our story with the hope that it will be helpful to others, as we move into a world where online events and conferences are going to be more prevalent than ever. The decision to take the conference online was not an easy one. Until quite late on, the question csv,conf organisers kept asking each other was not “how will we run the conference virtually?” but “will we need to cancel?“. As the pandemic intensified, this decision was taken out of our hands and it became quickly clear that cancelling our event in Washington D.C. was not only the responsible thing to do, but the only thing we could do.

Weighing the decision to hold csv,conf,v5 online

Once it was clear that we would not hold an in-person event, we deliberated on whether we would hold an online event, postpone, or cancel.

Moving online – The challenge

One of our main concerns was whether we would be able to encapsulate everything good about csv,conf in a virtual setting – the warmth you feel when you walk into the room, the interesting side conversations, and the feeling of being reunited with old friends, and naturally meeting new ones were things that we didn’t know whether we could pull off. And if we couldn’t, did we want to do this at all?

We were worried about keeping a commitment to speakers who had made a commitment themselves. But at the same time we were worried speakers may not be interested in delivering something virtually, or that it would not have the same appeal. It was important to us that there was value to the speakers, and at the start of this process we were committed to making this happen. Many of us have experience running events both in person and online, but this was bigger. We had some great advice and drew heavily on the experience of others in similar positions to us. But it still felt like this was different. We were starting from scratch and for all of our preparation, right up to the moment we pressed ‘go live’ inside Crowdcast, we simply didn’t know whether it was going to work. But what we found was that hard work, lots of planning and support of the community made it work. There were so many great things about the format that surprised and delighted us. We now find ourselves asking whether an online format is in fact a better fit for our community, and exploring what a hybrid conference might look like in the future.

Moving online – The opportunity

There were a great many reasons to embrace a virtual conference. Once we made the decision and started to plan, this became ever clearer. Not least was the fact that an online conference would give many more people the opportunity to attend. We work hard every year to reduce the barriers to attendance where possible and we’re grateful to our supporters here, but our ability to support conference speakers is limited and it is also probably the biggest cost year-on-year. We are conscious that barriers to entry still apply to a virtual conference, but they are different and it is clear that for csv,conf,v5 more people who wanted to join could be part of it. Csv,conf is normally attended by around 250 people. The in-person conferences usually fill up with just a few attendees under capacity. It feels the right size for our community. But this year we had over 1,000 registrations. More new people could attend and there were also more returning faces.


Attendees joined csv,conf,v5’s opening session from around the world

Planning an online conference

Despite the obvious differences, much about organising a conference remains the same whether virtual or not. Indeed, by the time we by the time we made the shift to an online conference, much of this work had been done.

Organising team

From about September 2019, the organising team met up regularly every few weeks on a virtual call. We reviewed our list of things and assigned actions. We used a private channel on Slack for core organisers to keep updated during the week.

We had a good mix of skills and interests on the organising team from community wranglers to writers and social media aces. We would like to give a shout out to the team of local volunteers we had on board to help with DC-specific things. In the end this knowledge just wasn’t needed for the virtual conf. We recruited a group of people from the organising team to act as the programme committee. This group would be responsible for running the call for proposals (CFP) and selecting the talks. We relied on our committed team of organisers for the conference and we found it helpful to have very clear roles/responsibilities to help manage the different aspects of the ‘live’ conference. We had a host who introduced speakers, a Q&A/chat monitor, a technical helper and a Safety Officer/Code of Conduct enforcer at all times. It was also helpful to have “floaters” who were unassigned to a specific task, but could help with urgent needs.

Selecting talks

We were keen on making it easy for people to complete the call for proposals. We set up a Google form and asked just a few simple questions. All talks were independently reviewed and scored by members of the committee and we had a final meeting to review our scores and come up with a final list. We were true to the scoring system, but there were other things to consider. Some speakers had submitted several talks and we had decided that even if several talks by the same person scored highly, only one could go into the final schedule. We value diversity of speakers, and reached out to diverse communities to advertise the call for proposals and also considered diversity when selecting talks. Also, where talks were scoring equally, we wanted to ensure we we’re giving priority to speakers who were new to the conference. We asked all speakers to post their slides onto the csv,conf Zenodo repository. This was really nice to have because attendees asked multiple times for links to slides, so we could simply send them to the Zenodo collection. Though it proved to not be relevant for 2020 virtual event, it’s worth mentioning that the process of granting travel or accommodation support to speakers was entirely separate from the selection criteria. Although we asked people to flag a request for support, this did not factor into the decision making process.

Creating a schedule

Before we could decide on a schedule, we needed to decide on the hours and timezones we would hold the conference. csv,conf is usually a two-day event with three concurrently run sessions, and we eventually decided to have the virtual event remain two days, but have one main talk session with limited concurrent talks.

Since the in-person conference was supposed to occur in Washington, D.C., many of our speakers were people in US timezones so we focused on timezones that would work best for those speakers. We also wanted to ensure that our conference organisers would be awake during the conference. We started at 10am Eastern, which was very early for West Coast (7am) and late afternoon for non-US attendees (3pm UK; 5pm Eastern Europe). We decided on seven hours of programming each day, meaning the conference ended in late afternoon for US attendees and late evening for Europe. Unfortunately, these timezones did not work for everyone (notably the Asia-Pacific region) and we recommend that you pick timezones that work for your speakers and your conference organisers whilst stretching things as far as possible if equal accessibility is important to you. We also found it was important to clearly list the conference times in multiple timezones on our schedule so that it was easier for attendees to know what time the talks were happening.

Tickets and registration

Although most of what makes csv,conf successful is human passion and attention (and time!), we also found that the costs involved in running a virtual conference are minimal. Except for some extra costs for upgrading our communication platforms, and making funds available to support speakers in getting online, running the conference remotely saved us several thousand dollars.

We have always used an honour system for ticket pricing. We ask people pay what they can afford, with some suggested amounts depending on the attendees situation. But we needed to make some subtle changes for the online event, as it was a different proposition. We first made it clear that tickets were free, and refunded those who had already purchased tickets. Eventbrite is the platform we have always used for registering attendees for the conference, and it does the job. It’s easy to use and straightforward. We kept it running this year for consistency and to ensure we’re keeping our data organised, even though it involved importing the data into another platform. We were able to make the conference donation based thanks to the support of the Sloan Foundation and individual contributors and donations. Perhaps because the overall registrations also went up, we found that the donations also went up. In future – and with more planning and promotion – it would be feasible to consider a virtual event of the scale of csv,conf funded entirely by contributions from the community it serves.

Code of Conduct

We spent significant time enhancing our Code of Conduct for the virtual conference. We took in feedback from last year’s conference and reviewed other organisations’ Code of Conduct. The main changes were to consider how a Code of Conduct needed to relate to the specifics of something happening online. We also wanted to create more transparency in the enforcement and decision-making processes.

One new aspect was the ability to report incidents via Slack. We designated two event organisers as “Safety Officers”, and they were responsible for responding to any incident reports and were available for direct messaging via Slack (see the Code of Conduct for full details). We also provided a neutral party to receive incident reports if there were any conflicts of interest.

Communication via Slack

We used Slack for communication during the conference, and received positive feedback about this choice. We added everyone that registered to the Slack channel to ensure that everyone would receive important messages.

We had a Slack session bot that would announce the beginning of each session with the link to the session and we received a lot of positive feedback about the session-bot. For people not on Slack, we also had the schedule in a Google spreadsheet and on the website, and everyone that registered with an email received the talk links via email too. For the session bot, we used the Google Calendar for Team Events app on Slack. Another popular Slack channel that was created for this conference was a dedicated Q&A channel allowing speakers to interact with session attendees, providing more context around their talks, linking to resources, and chatting about possible collaborations. At the end of each talk, one organiser would copy all of the questions and post them into this Q&A channel so that the conversations could continue. We received a lot of positive feedback about this and it was pleasing to see the conversations continue. We also had a dedicated speakers channel, where speakers could ask questions and offer mutual support and encouragement both before and during the event. Another important channel was a backchannel for organisers, which we used mainly to coordinate and cheer each other on during the conf. We also used this to ask for technical help behind the scenes to ensure everything ran as smoothly as possible. After talks, one organiser would use Slack private messaging to collate and send positive feedback for speakers, as articulated by attendees during the session. This was absolutely worth it and we were really pleased to see the effort was appreciated. Slack is of course free, but its premium service does offer upgrades for charities and we were lucky enough to make use of this. The application process is very easy and takes less that 10 mins so this is worth considering. We made good use of Twitter throughout the conference and there were active #commallama and #csvconf hashtags going throughout the event. The organisers had joint responsibility for this and this seemed to work. We simply announced the hashtags at the beginning of the day and people picked them up easily. We had a philosophy of ‘over-communicating’ – offering updates as soon as we had them, and candidly. We used it to to share updates, calls-to-action, and to amplify people’s thoughts, questions and feedback

Picking a video conference platform

Zoom concerns

One of the biggest decisions we had to make was picking a video conferencing platform for the conference. We originally considered using Zoom, but were concerned about a few things. The first was reports of rampant “zoombombing”, where trolls join Zoom meetings with the intent to disrupt the meeting. The second concern was that we are a small team of organisers and there would be great overhead in moderating a Zoom room with hundreds of attendees – muting, unmuting, etc. We also worried that a giant Zoom room would feel very impersonal. Many of us now spend what is probably an unnecessary amount of our daily lives on Zoom and we also felt that stepping away from this would help mark the occasion as something special, so we made the decision to move away from Zoom and we looked to options that we’re more of a broadcast tool than meeting tool.

Crowdcast benefits

We saw another virtual conference that used Crowdcast and were impressed with how it felt to participate, so we started to investigate it as a platform before enthusiastically committing to it, with some reservations.

The best parts of Crowdcast to us were the friendly user interface, which includes a speaker video screen, a dedicated chat section with a prompt bar reading “say something nice”, and a separate box for questions. It felt really intuitive and the features were considered, useful and we incorporated most of them. From the speaker, participant and host side, the experience felt good and appropriate. The consideration on the different user types was clear in the design and appreciated. One great function was that of a green room, which is akin to a speakers’ couch at the backstage of an in-person conference, helping to calm speakers’ nerves, check their audio and visual settings, discuss cues, etc. before stepping out onto the stage. Another benefit of Crowdcast is that the talks are immediately available for viewing, complete with chat messages for people to revisit after the conference. This was great as it allowed people to catch up in almost real time and so catch up quickly if they missed something on the day and feel part of the conference discussions as the developed. We also released all talk videos on YouTube and tweeted the links to each talk.

Crowdcast challenges

But Crowdcast was not without its limitations. Everything went very well, and the following issues were not deal breakers, but acknowledging them can help future organisers plan and manage expectations.

Top of the list of concerns was our complete inexperience with it and the likely inexperience of our speakers. To ensure that our speakers were comfortable using Crowdcast, we held many practice sessions with speakers before the conference, and also had an attendee AMA before the conference to get attendees acquainted with the platform. These sessions were vital for us to practice all together and this time and effort absolutely paid off! If there is one piece of advice you should take away from reading this guide it is this: practice practice practice, and give others the opportunity and space to practice as well. One challenge we faced was hosting – only one account has host privileges, but we learned that many people can log into that account at the same time to share host privileges. Hosts can allow other people to share their screen and unmute, and they can also elevate questions from the chat to the questions box. They can also kick people out if they are being disruptive (which didn’t happen for us, but we wanted to be prepared). This felt a bit weird, honestly, and we had to be careful to be aware of the power we had when in the hosts position. Weird, but also incredibly useful and a key control feature which was essential for an event run by a group rather than an individual. With Crowdcast, you can only share four screens at a time (so that would be two people sharing two screens). Our usual setup was a host, with one speaker sharing their screen at a time. We could add a speaker for the talks that only had a single other speaker but any more that this we would have had problems. It was easy enough for the host to chop and change who is on screen at any time, and there’s no limit on the total number of speakers in a session. So there is some flexibility, and ultimately, we were OK. But this should be a big consideration if you are running an event with different forms of presentation. Crowdcast was also not without its technical hiccups and frustrations. Speakers sometimes fell off the call or had mysterious problems sharing their screens. We received multiple comments/questions on the day about the video lagging/buffering. We often had to resort to the ol’ refresh refresh refresh approach which, to be fair, mostly worked. And on the few occasions we were stumped, there’s quite a lot of support available online and directly from Crowdcast. But honestly, there were very few technical issues for a two-day online conference. Some attendees wanted info on the speakers (ex: name, twitter handle) during the presentation and we agree it would have been a nice touch to have a button or link in Crowdcast. There is the “call to action” feature, but we were using that to link to the code of conduct. Crowdcast was new to us, and new to many people in the conference community. As well as these practices we found it helpful to set up an FAQ page with content about how to use Crowdcast and what to expect from an online conference in general. Overall, it was a good decision and a platform we would recommend for consideration.

#Commallama

Finally, it would not be csv,conf if it had not been for the #commallama. The comma llama first joined us for csv,conf,v3 in Portland and joined us again for csv,conf,v4. The experience of being around a llama is both relaxing and energising at the same time, and a good way to get people mixing.

Taking the llama online was something we had to do and we were very pleased with how it worked. It was amazing to see how much joy people go out of the experience and also interesting to notice how well people naturally adapted to the online environment. People naturally organised into a virtual queue and took turns coming on to the screen to screengrab a selfie. Thanks to our friends at Mtn Peaks Therapy Llamas & Alpacas for being so accommodating and helping us to make this possible.

A big thank you to our community and supporters

As we reflect on the experience this year, one thing is very clear to us: The conference was only possible because of the community to speak, attend and supported us. It was a success because the community showed up, was kind, welcoming and extremely generous with their knowledge, ideas and time. The local people in D.C. who stepped up to offer knowledge and support on the ground in D.C. was a great example of this and we are incredibly grateful or the support, though this turned out not to be needed.

We were lucky to have a community of developers, journalists, scientists and civic activists who intrinsically know how to interact and support one another online, and who adapted to the realities of an online conference well. From the moment speakers attended our practice sessions on the platform and started to support one another, we knew that things we’re going to work out. We knew things would not all run to plan, but we trusted that the community would be understanding and actively support us in solving problems. It’s something we are grateful for. We were also thankful to Alfred P. SLOAN Foundation and our 100+ individual supporters for making the decision to support us financially. It is worth noting that none of this would have been possible without our planned venue, hotel and catering contracts being very understanding in letting us void our contracts without any penalties.

Looking ahead – the future of csv,conf

Many people have been asking us about the future of csv,conf. Firstly it’s clear that the csv,conf,v5 has given us renewed love for the conference and made it abundantly clear to us of the need for a conference like this in the world. It’s also probably the case that the momentum generated by running the conference this year will secure enthusiasm amongst organisers for putting something together next year.

So the questions will be “what should a future csv,conf look like?”. We will certainly be considering our experience of running this years event online. It was such a success that there is an argument for keeping it online going forward, or putting together something of a hybrid. Time will tell. We hope that this has been useful for others. If you are organising an event and have suggestions or further questions that could improve this resource, please let us know. Our Slack remains open and is the best place to get in touch with us. • The original version of this blogpost was published on csvconf.com and republished here with kind permission.

Mapping HIV facilities and LGBT-friendly spaces in the Philippines: Open Data Day 2020 report

- June 8, 2020 in Open Data Day, Open Data Day 2020, Open Knowledge, philippines

On Saturday 7th March 2020, the tenth Open Data Day took place with people around the world organising over 300 events to celebrate, promote and spread the use of open data. Thanks to generous support from key funders, the Open Knowledge Foundation was able to support the running of more than 60 of these events via our mini-grants scheme This blogpost is a report by Mikko Tamura from MapBeks in the Philippines who received funding from the Foreign and Commonwealth Office to organise a mapping party to highlight HIV facilities and LGBT-friendly spaces on OpenStreetMap. The first Open Data Day celebration of Pilipinas Chubs X Chasers (PCC) was done at the Fahrenheit Club on 7th March 2020. The event started at 9pm and 16 participants joined the datathon to improve on the largest open HIV database in the country. PCC, the largest online community of LGBTQIA+ chubs, chasers and bears in the Philippines has dedicated its celebration of Open Data Day to teach fellow community (bear, chubs, chasers, supporters) members the importance of emancipating the data and making it more accessible to everyone who needs it. Mr Papu Torres, chairperson of PCC, and Mr Mikko Tamura, lead advocate of MapBeks, spearheaded the event as they believe that as members of their growing community can make a bigger impact by encouraging people to contribute by simply researching and validating information. Currently, MapBeks with the full support of PCC, has been developing maps and databases highlighting and representing the LGBTQ community. You may contribute to their working HIV database and maps here:  HIV Facilities Email/Website working database https://docs.google.com/spreadsheets/d/1jdlDJw3eue0e6YK9mKgYfaTw5962e2H1g_tWZQscbcE/ edit?usp=sharing  HIV Facilities map http://tinyurl.com/mapbekshivmap  PCC’s objective of the ODD event was to collect/complete as much information on websites and email addresses of all HIV facilities in the country and contribute the information on OpenStreetMap for everyone to access.  Figure 1 Papu Torres, chairperson of Pilipinas Chubs X Chasers, teaching how to contribute and research on websites and emails for the open database of HIV Facilities. According to MapBeks, there are a total of 659 identified HIV facilities in the country, but the data varies from one organisation to another. It is part of the group’s advocacy to make such information more accessible, downloadable, and useable to the crowd specially for people living with HIV (PLHIV).
The SlumBEAR Datathon was able to contribute 250 email addresses and 383 websites to the database. This constitutes more than 20% of the needed information that will be added on geotagged locations of clinics, hospitals, and health centres in the country. Lastly, Mr Leonard Kodie Macayan III, Mr. Fahrenheit 2020 and our country’s representative to the Mr Gay World competition, visited the crowd and showed support for our open data movement. Mr Fahrenheit is an annual male pageant, exclusive only for gay men and bisexuals. It is the first and the longest running of its kind, having been set up in 2003. Its mission is to support the advocacy towards mental health and tackling various issues like clinical depression, suicidal tendencies and the stigma of HIV/AIDS.  We would like to specially thank the Open Knowledge Foundation for the support and trust for our small community, and to Papu Torres, John Mojica, Ryan Sotto for making things possible. The Pilipinas Chubs X Chasers community will continue to support activities such as this in the future as it deems it necessary and empowering for smaller communities to be part of something bigger. See you next Open Data Day! For more information and partnerships, contact Mikko Tamura at mikko.tamura@gmail.com.

New opinion poll – UK contact-tracing app must take account of human rights

- May 4, 2020 in COVID-19, News, Open Knowledge, Open Knowledge Foundation

A new opinion poll has revealed that an overwhelming majority of Brits want any coronavirus contact-tracing app to take account of civil liberties and people’s privacy. The Survation poll for the Open Knowledge Foundation comes ahead of today’s evidence session at Parliament’s Joint Committee on Human Rights on the human rights implications of COVID-19 tracing apps. The poll has found widespread support for the introduction of a contact-tracing app in the UK at 65 per cent, but 90 per cent of respondents said it is important that any app takes account of civil liberties and protects people’s privacy. A total of 49 per cent of people in the poll of over 1,000 people in the UK said this was ‘very important’. An NHS contact-tracing app designed to alert users when they have come into contact with someone who has coronavirus symptoms and should seek a COVID-19 test will be trialled on the Isle of Wight this week. Human rights campaigners have raised questions about how the data will be processed, who will own the information, and how long it will be kept for. The UK is understood to be working towards a centralised model, but this approach has been abandoned in Germany due to privacy concerns. Other countries, including Ireland, are using a decentralised model, where information is only held on individual smartphones, not a server. Today, a series of experts will be giving oral evidence to the Joint Committee on Human Rights, including the UK Information Commissioner. Catherine Stihler, chief executive of the Open Knowledge Foundation, said: “Technology will rightly play a key role in the global response to the coronavirus pandemic, and there is clear support in the UK for a contact-tracing app in the UK. “But what is even clearer is that people want the app to take account of civil liberties and ensure that people’s privacy is protected. “We must not lose sight of ethical responsibilities in the rush to develop these tools. “It is vital to balance the needs of individuals and the benefit to society, ensuring that human rights are protected to secure public trust and confidence in the system.” Poll results Opinion poll conducted by Survation on behalf of the Open Knowledge Foundation. Fieldwork conducted 27-28th April 2020, all residents aged 18+ living in UK, sample size 1,006 respondents. Q) Smartphone software called ‘contact-tracing’ is being developed to alert users when someone they were recently close to becomes infected with COVID-19. Contact-tracing apps log every instance a person is close to another smartphone-owner for a significant period of time. It has not been announced how your data will be processed, who will own the information, and how long it will be kept for. To what extent do you support or oppose the introduction of a contact-tracing app in the UK during the coronavirus pandemic? Strongly support: 28%
Somewhat support: 37%
Neither support nor oppose: 18%
Somewhat oppose: 6%
Strongly oppose: 6%
Don’t know: 4% Q) How important is it to you that any contact-tracing app in the UK takes account of civil liberties and protects people’s privacy? Very important: 49%
Quite important: 29%
Somewhat important: 13%
Not so important: 5%
Not at all important: 1%
Don’t know: 4%

Making remote working work for you and your organisation

- March 19, 2020 in Open Knowledge, Open Knowledge Foundation

The coronavirus outbreak means that up to 20 per cent of the UK workforce could be off sick or self-isolating during the peak of an epidemic.

Millions of people may not be ill, but they will be following expert advice to stay away from their workplace to help prevent the spread of the virus.

There are clearly hundreds of roles where working from home simply isn’t possible, and questions are rightly being asked about ensuring people’s entitlement to sick pay.

But for a huge number of workers who are usually based in an office environment, remote working is a possibility – and is therefore likely to become the norm for millions.

With the economy in major trouble as evidenced by yesterday’s stock market falls, ensuring those who are fit and able can continue to work is important.

So employers should start today to prepare for efficient remote working as part of their coronavirus contingency planning.

Giant companies such as Twitter are already prepared. But this may be an entirely new concept for some firms.

The Open Knowledge Foundation which I lead has been successfully operating remote working for several years.

Our staff are based in their homes in countries across the world, including the UK, Portugal, Zimbabwe and Australia.

Remote working was new to me a year ago when I joined the organisation.

I had been based in the European Parliament for 20 years as an MEP for Scotland. I had a large office on the 13th floor of the Parliament in Brussels, with space for my staff, as well as an office in Strasbourg when we were based there. For most of my time as a politician, I also had an office in Fife where my team would deal with constituents’ queries.

Things couldn’t be more different today. I work from my home in Dunfermline, in front of my desktop computer, with two screens so that I can type on one and keep an eye on real-time alerts on another.

The most obvious advantage is being able to see more of my family. Being a politician meant a lot of time away from my husband and children, and I very much sympathise with MSPs such as Gail Ross and Aileen Campbell who have decided to stand down from Holyrood to see more of their loved ones. If we want our parliaments to reflect society, we need to address the existing barriers to public office.

Now in charge of a team spread around the world, using a number of technology tools to communicate with them, remote working has been a revelation for me.

Why couldn’t I have used those tools in the European Parliament and even voted remotely?

In the same way that Gail Ross has questioned why there wasn’t a way for her to vote remotely from Wick, hundreds of miles from Edinburgh, the same question must be asked of the European Parliament.

But for companies now planning remote working, it is vital to adopt effective methods.

Access to reliable Wi-Fi is key, but effective communication is critical. Without physical interaction, a virtual space with video calling is essential.

It is important to see the person when remote working and be able to interact as close as it would be face-to-face. This also avoids distraction and allows people to check in with each other.

We tend to do staff calls through our Slack channel and our weekly all-staff call is through Google Hangout.

All-staff calls – or all-hands call as we call them – are important if people are forced to work remotely. We do this once a week, but for some organisations morning calls will also become an essential part of the day.

Our monthly global network call is on an open source tool called Jitsi and I use Zoom for diary meetings.

If all else fails, we resort to Skype and WhatsApp.

In terms of how we share documents between the team, we use Google Drive. That means participants in conference calls can see and update an agenda and add action points in real-time, and make alterations or comments on documents such as letters which need to be checked by multiple people.

In the same way that our staff work and collaborate remotely, using technology to co-operate on a wider scale also goes to the heart of our vision for a future that is fair, free and open.

We live in a time when technological advances offer incredible opportunities for us all.

Open knowledge will lead to enlightened societies around the world, where everyone has access to key information and the ability to use it to understand and shape their lives; where powerful institutions are comprehensible and accountable; and where vital research information that can help us tackle challenges such as poverty and climate change is available to all.

Campaigning for this openness in society is what our day job entails.

But to achieve that we have first worked hard to bring our own people together using various technological options.

Different organisations will find different ways of making it work.

But what is important is to have a plan in place today.

This post was originally published by the Herald newspaper

Frictionless Public Utility Data: A Pilot Study

- March 18, 2020 in Open Knowledge

This blog post describes a Frictionless Data Pilot with the Public Utility Data Liberation project. Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by Zane Selvans, Christina Gosnell, and Lilly Winfree. The Public Utility Data Liberation project, PUDL, aims to make US energy data easier to access and use. Much of this data, including information about the cost of electricity, how much fuel is being burned, powerplant usage, and emissions, is not well documented or is in difficult to use formats. Last year, PUDL joined forces with the Frictionless Data for Reproducible Research team as a Pilot project to release this public utility data. PUDL takes the original spreadsheets, CSV files, and databases and turns them into unified Frictionless tabular data packages that can be used to populate a database, or read in directly with Python, R, Microsoft Access, and many other tools.   

What is PUDL?

The PUDL project, which is coordinated by Catalyst Cooperative, is focused on creating an energy utility data product that can serve a wide range of users. PUDL was inspired to make this data more accessible because the current US utility data ecosystem fragmented, and commercial products are expensive. There are hundreds of gigabytes of information available from government agencies, but they are often difficult to work with, and different sources can be hard to combine. PUDL users include researchers, activists, journalists, and policy makers. They have a wide range of technical backgrounds, from grassroots organizers who might only feel comfortable with spreadsheets, to PhDs with cloud computing resources, so it was important to provide data that would work for all users.  Before PUDL, much of this data was freely available to download from various sources, but it was typically messy and not well documented. This led to a lack of uniformity and reproducibility amongst projects that were using this data. The users were scraping the data together in their own way, making it hard to compare analyses or understand outcomes. Therefore, one of the goals for PUDL was to minimize these duplicated efforts, and enable the creation of lasting, cumulative outputs.

What were the main Pilot goals?

The main focus of this Pilot was to create a way to openly share the utility data in a reproducible way that would be understandable to PUDL’s many potential users. The first change Catalyst identified they wanted to make during the Pilot was with their data storage medium. PUDL was previously creating a Postgresql database as the main data output. However many users,  even those with technical experience, found setting up the separate database software a major hurdle that prevented them from accessing and using the processed data. They also desired a static, archivable, platform-independent format. Therefore, Catalyst decided to transition PUDL away from PostgreSQL, and instead try Frictionless Tabular Data Packages. They also wanted a way to share the processed data without needing to commit to long-term maintenance and curation, meaning they needed the outputs to continue being useful to users even if they only had minimal resources to dedicate to the maintenance and updates. The team decided to package their data into Tabular Data Packages and identified Zenodo as a good option for openly hosting that packaged data. Catalyst also recognized that most users only want to download the outputs and use them directly, and did not care about reproducing the data processing pipeline themselves, but it was still important to provide the processing pipeline code publicly to support transparency and reproducibility. Therefore, in this Pilot, they focused on transitioning their existing ETL pipeline from outputting a PostgreSQL database, that was defined using SQLAlchemy, to outputting datapackages which could then be archived publicly on Zenodo. Importantly, they needed this pipeline to maintain the metadata, information about data type, and database structural information that had already been accumulated. This rich metadata needed to be stored alongside the data itself, so future users could understand where the data came from and understand its meaning. The Catalyst team used Tabular Data Packages to record and store this metadata (see the code here: https://github.com/catalyst-cooperative/pudl/blob/master/src/pudl/load/metadata.py). Another complicating factor is that many of the PUDL datasets are fairly entangled with each other. The PUDL team ideally wanted users to be able to pick and choose which datasets they actually wanted to download and use without requiring them to download it all (currently about 100GB of data when uncompressed). However, they were worried that if single datasets were downloaded, the users might miss that some of the datasets were meant to be used together. So, the PUDL team created information, which they call “glue”,  that shows which datasets are linked together and that should ideally be used in tandem.  The cumulation of this Pilot was a release of the PUDL data (access it here – https://zenodo.org/record/3672068 and read the corresponding documentation here – https://catalystcoop-pudl.readthedocs.io/en/v0.3.2/), which includes integrated data from the EIA Form 860, EIA Form 923, The EPA Continuous Emissions Monitoring System (CEMS), The EPA Integrated Planning Model (IPM), and FERC Form 1.

What problems were encountered during this Pilot?

One issue that the group encountered during the Pilot was that the data types available in Postgres are substantially richer than those natively in the Tabular Data Package standard. However, this issue is an endemic problem of wanting to work with several different platforms, and so the team compromised and worked with the least common denominator.  In the future, PUDL might store several different sets of data types for use in different contexts, for example, one for freezing the data out into data packages, one for SQLite, and one for Pandas.  Another problem encountered during the Pilot resulted from testing the limits of the draft Tabular Data Package specifications. There were aspects of the specifications that the Catalyst team assumed were fully implemented in the reference (Python) implementation of the Frictionless toolset, but were in fact still works in progress. This work led the Frictionless team to start a documentation improvement project, including a revision of the specifications website to incorporate this feedback.  Through the pilot, the teams worked to implement new Frictionless features, including the specification of composite primary keys and foreign key references that point to external data packages. Other new Frictionless functionality that was created with this Pilot included partitioning of large resources into resource groups in which all resources use identical table schemas, and adding gzip compression of resources. The Pilot also focused on implementing more complete validation through goodtables, including bytes/hash checks, foreign keys checks, and primary keys checks, though there is still more work to be done here.

Future Directions

A common problem with using publicly available energy data is that the federal agencies creating the data do not use version control or maintain change logs for the data they publish, but they do frequently go back years after the fact to revise or alter previously published data — with no notification. To combat this problem, Catalyst is using data packages to encapsulate the raw inputs to the ETL process. They are setting up a process which will periodically check to see if the federal agencies’ posted data has been updated or changed, create an archive, and upload it to Zenodo. They will also store metadata in non-tabular data packages, indicating which information is stored in each file (year, state, month, etc.) so that there can be a uniform process of querying those raw input data packages. This will mean the raw inputs won’t have to be archived alongside every data release. Instead one can simply refer to these other versioned archives of the inputs. Catalyst hopes these version controlled raw archives will also be useful to other researchers. Another next step for Catalyst will be to make the ETL and new dataset integration more modular to hopefully make it easier for others to integrate new datasets. For instance, they are planning on integrating the EIA 861 and the ISO/RTO LMP data next. Other future plans include simplifying metadata storage, using Docker to containerize the ETL process for better reproducibility, and setting up a Pangeo  instance for live interactive data access without requiring anyone to download any data at all. The team would also like to build visualizations that sit on top of the database, making an interactive, regularly updated map of US coal plants and their operating costs, compared to new renewable energy in the same area. They would also like to visualize power plant operational attributes from EPA CEMS (e.g., ramp rates, min/max operating loads, relationship between load factor and heat rate, marginal additional fuel required for a startup event…).  Have you used PUDL? The team would love to hear feedback from users of the published data so that they can understand how to improve it, based on real user experiences. If you are integrating other US energy/electricity data of interest, please talk to the PUDL team about whether they might want to integrate it into PUDL to help ensure that it’s all more standardized and can be maintained long term. Also let them know what other datasets you would find useful (E.g. FERC EQR, FERC 714, PHMSA Pipelines, MSHA mines…).  If you have questions, please ask them on GitHub (https://github.com/catalyst-cooperative/pudl) so that the answers will be public for others to find as well.

Tracking the Trade of Octopus (and Packaging the Data)

- March 13, 2020 in Frictionless Data, Open Knowledge

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/. By Lily Zhao

Introduction

When I started graduate school, I was shocked to learn that seafood is actually the most internationally traded food commodity in the world. In fact, the global trade in fish is worth more than the trades of tea, coffee and sugar combined (Fisheries FAO, 2006). However, for many developing countries being connected to the global seafood market can be a double-edged sword. It is true global trade has the potential to redistribute some wealth and improve the livelihoods of fishers and traders in these countries. But it can also promote illegal trade and overfishing, which can harm the future sustainability of a local food source. Over the course of my master’s degree, I developed a passion for studying these issues, which is why I am excited to share with you my experience turning some of the data my collaborators into a packaged dataset using the Open Knowledge Foundation’s Datapackage tool. These data provide a snapshot into the global market for octopus and how it is traded throughout and between Kenya, Tanzania and Mozambique before heading to European markets. This research project was an international collaboration between the Stockholm Resilience Centre in Sweden, the National Institute for Medical Research, of Tanzania, Pwani University in Kilifi, Kenya and the School of Marine and Environmental Affairs at the University of Washington. These data eventually became my master’s thesis and this data package will complement a forthcoming publication of our findings. Specifically, these data are the prices and quantities at which middlemen in Tanzania and Kenya reported buying and selling octopus. These data are exciting because they not only inform our understanding of who is benefiting from the trade of octopus by also could assist in improving the market price octopus in Tanzania. This is because value chain information can help Tanzania’s octopus fishery along its path to Marine Stewardship Council seafood certification. Seafood that gets the Marine Stewardship Council Label gains a certain amount of credibility which in turn can increase profit. For developing countries, this seafood label can provide a monetary incentive for improving fisheries management. But before Tanzania’s octopus fishery can get certified, they will need to prove they can trace the flow of their octopus supply chain, and manage it sustainably. We hope that this packaged dataset will ultimately inform this effort.

Getting the data

To gather the data my field partner Chris and I went to 10 different fishing communities like this one. mtwara

Middlemen buy and sell seafood in Mtwara, Tanzania.

We went on to interview all the major exporters of octopus in both Tanzania and Kenya and spoke with company agents and octopus traders who bought their octopus from 570 different fishermen. With these interviews were able to account for about 95% of East Africa’s international octopus market share. Octopus

My research partner- Chris Cheupe, and I at an octopus collection point.

Creating the Data Package

The datapackage tool was created by the Open Knowledge Foundation to compile our data and metadata in a compact unit, making it easier and more efficient for others to access. You can create the data package using the online platform or using the Python or R programming software libraries. I had some issues using the R package instead of the online tool initially, which may have been related to the fact that the original data file was not utf-8 encoded. But stay tuned! For now, I made my datapackage using the Data Package Creator online tool. The tool helped me create a schema that outlines the data’s structure including a description of each column. The tool also helps you outline the metadata for the dataset as a whole, including information like the license and author. Our dataset has a lot of complicated columns and the tool gave me a streamlined way to describe each column via the schema. Afterwords, I added the metadata using the lefthand side of the browser tool and checked to make sure that the data package was valid!   valid data

The green bar at the top of the screenshot indicates validity

If the information you provide for each column does not match the data within the columns the package will not validate and instead, you will get an error like this: invalid data  

The red bar at the top of the screenshot indicates invalidity

Checkout my final datapackage by visiting my github repository!

Reference:

Fisheries, F. A. O. (2006). The state of world fisheries and aquaculture 2006.

Celebrating the tenth Open Data Day on Saturday 7th March 2020

- March 6, 2020 in Open Data Day, Open Data Day 2020, Open Knowledge

Open Data Day 2020 In Ghana, satellite and drone imagery is being used to track deforestation and water pollution in West Africa. In South Africa, the first map of minibus taxi routes in a township in Pretoria is being created. In the Philippines, a map is being designed to highlight HIV facilities and LGBT-friendly spaces, while a similar project is underway in Granada to assess the housing situation of migrant women. And in Mexico, construction projects are being analysed to check their impact on the local environment. All these community-led projects, and many more like it, are improving lives for people in some of the world’s most deprived areas. They are all linked by one thing: open data. This Saturday is the tenth annual Open Data Day, which celebrates its transformational impact around the globe. Open data is data that can be freely accessed, used, modified and shared by anyone. It is the opposite of personal data, which must be kept private and there have rightly been concerns raised about how that is used by giant technology firms. Open data is altogether different – this is non-personal information, and it can and should be used for the public good. It is the building block of what is called ‘open knowledge’, which is what data can become if it is useful, usable and used. The key features of openness are availability and access, reuse and redistribution and universal participation. Open Data Day is an opportunity to show its benefits and encourage the adoption of open data policies in government, business and civil society. The Open Knowledge Foundation operates a mini-grants scheme for community projects every year, and in 2020 we are supporting 65 events taking place all over the world including in Argentina, Bolivia, Brazil, Cameroon, Colombia, Costa Rica, Germany, Ghana, Guatemala, Indonesia, Kenya, Malawi, Mexico, Nigeria, Somalia, South Africa, Tanzania, Togo and Venezuela. With the climate crisis now an emergency, open data can help tackle deforestation and monitor air pollution levels on our streets. It is being used in places such as the Democratic Republic of the Congo to increase young people’s knowledge of free local HIV-related services. In Nepal, streetlights data for Kathmandu has been collected by digital volunteers to influence policy for the maintenance of streetlights. The possibilities are endless. Open data can track the flow of public money, expanding budget transparency, examining tax data and raising issues around public finance management. And it can be used by communities to highlight pressing issues on a local, national or global level, such as progress towards the United Nations’ Sustainable Development Goals. I know that phrases like ‘open data’ and ‘open knowledge’ are not widely understood. With partners across the world, we are working to change that. This decade and the decades beyond are not to be feared. We live in a time when technological advances offer incredible opportunities for us all. This is a time to be hopeful about the future, and to inspire those who want to build a better society. Open knowledge will lead to enlightened societies around the world, where everyone has access to key information and the ability to use it to understand and shape their lives; where powerful institutions are comprehensible and accountable; and where vital research information that can help us tackle challenges such as poverty and climate change is available to all: a fair, free and open future. • The tenth Open Data Day will take place on Saturday 7th March 2020 with celebrations happening all over the world. Find out more at opendataday.org, discover events taking place near you and follow the conversation online via the hashtags #OpenDataDay and #ODD2020.

Breaking up big tech isn’t enough. We need to break them open

- February 27, 2020 in Open Knowledge, Open Knowledge Foundation, personal-data

From advocates, politicians and technologists, calls for doing something about big tech grow louder by the day. Yet concrete ideas are few or failing to reach the mainstream. This post covers what breaking up big tech would mean and why it’s not enough. I propose an open intervention that will give people a real choice and a way out of controlled walled gardens. Google, Facebook, Amazon and Apple are not natural monopolies and we need to regulate them to support competition and alternative business models.

What’s the problem?

As a social species, our social and digital infrastructure is of vital importance. Just think of the postal service that even in the most extreme circumstances, would deliver letters to soldiers fighting on the front lines. There’s a complicated and not-risk-free system that makes this work, and we make it work, because it matters. It is so important for us to keep in touch with our loved ones, stay connected with news and what’s happening in our communities, countries and the planet. Our ability to easily and instantly collaborate and work with people halfway across the world is one of the wonders of the Information Age. The data we collect can help us make better decisions about our environment, transport, healthcare, education, governance and planning. It should be used to support the flourishing of all people and our planet. But right now, so much of this data, so much of our social digital infrastructure, is owned, designed and controlled by a tiny elite of companies, driven by profit. We’re witnessing the unaccountable corporate capture of essential services, reliance on exploitative business models and the increasing dominance of big tech monopolies. Amazon, Facebook, Google, Apple and Microsoft use their amassed power to subvert the markets which they operate within, stifling competition and denying us real choice. Amazon has put thousands of companies out of business, leaving them the option to sell on their controlled platform or not sell at all. Once just a digital bookstore, Amazon now controls over 49% of the US digital commerce market (and growing fast) — selling everything from sex toys to cupcakes. Facebook (who, remember, also own Instagram and WhatsApp) dominates social, isolating people who don’t want to use their services. About a fifth of the population of the entire planet (1.6 billion) log in daily. They control a vast honeypot of personal data, vulnerable to data breaches, influencing elections and enabling the spread of misinformation. It’s tough to imagine a digital industry Google doesn’t operate in. These companies are too big, too powerful and too unaccountable. We can’t get them to change their behaviour. We can’t even get them to pay their taxes. And it’s way past time to do something about this.

Plans to break up monopolies

Several politicians are calling for breaking up big tech. In the USA, presidential candidate Elizabeth Warren wants two key interventions. One is to reverse some of the bigger controversial mergers and acquisitions which have happened over the last few years, such as Facebook with WhatsApp and Instagram, while going for a stricter interpretation and enforcement of anti-trust law. The other intervention is even more interesting, and an acknowledgement of how much harm comes from monopolies who are themselves intermediaries between producers and consumers. Elizabeth Warren wants to pass “legislation that requires large tech platforms to be designated as ‘Platform Utilities’ and broken apart from any participant on that platform”. This would mean that Amazon, Facebook or Google could not both be the platform provider and sell their own services and products through the platform. The EU has also taken aim at such platform power abuse. Google was fined €2.4 billion by the European Commission for denying “consumers a genuine choice by using its search engine to unfairly steer them to its own shopping platform”. Likewise, Amazon is currently under formal investigation for using their privileged access to their platform data to put out competing products and outcompete other companies’ products. Meanwhile, in India, a foreign-owned company like Amazon is already prohibited from being a vendor on their own electronic market place.

Breaking up big tech is not enough

While break up plans will go some way to address the unhealthy centralisation of data and power, the two biggest problems with big tech monopolies will remain:
  1. It won’t give us better privacy or change the surveillance business models used by tech platforms; and
  2. It won’t provide genuine choice or accountability, leaving essential digital services under the control of big tech.
The first point relates to the toxic and anti-competitive business models increasingly known as ‘Surveillance capitalism’. Smarter people than me have written about the dangers and dark patterns that emerge from this practice. When the commodity these companies profit from is your time and attention, these multi-billion companies are incentivised to hook you, manipulate you and keep dialing up the rampant consumerism which is destroying our planet. Our privacy and time is constantly exploited for profit. The break ups Warren proposes won’t change this. The second point means it still wouldn’t become it easier for other companies to compete or to experiment with alternative business models. Right now, it’s near impossible to compete with Facebook and Amazon since their dominance is built on ‘network effects’. Both companies strictly police their user network and data. People aren’t choosing these platforms because they are better, they default to them because that’s where everyone else is. Connectivity and reach is vital for people to communicate, share, organise and sell — there’s no option but to go where most people already are. So we’re increasingly locked in. We need to make it possible for other providers and services to thrive.

Breaking big tech open

Facebook’s numerous would-be competitors don’t fail through not being good enough or failing to get traction, or even funding. Path was beautiful and had many advantages over Facebook. Privacy-preserving Diaspora got a huge amount of initial attention. Scuttlebutt has fantastic communities. Alternatives do exist. None of them have reduced the dominance of Facebook. The problem is not a lack of alternatives, the problem is closed design, business model and network effects. What Facebook has, that no rival has, is all your friends. And where it keeps them is in a walled off garden which Facebook controls. No one can interact with Facebook users without having a Facebook account and agreeing to Facebook’s terms and conditions (aka surveillance and advertising). Essentially, Facebook owns my social graph and decides on what terms I can interact with my friends. The same goes for other big social platforms: to talk to people on LinkedIn, I have to have a LinkedIn account; to follow people on Twitter, I must first sign up to Twitter and so on. As users we take on the burden of maintaining numerous accounts, numerous passwords, sharing our data and content with all of these companies, on their terms. It doesn’t have to be this way. These monopolies are not natural, they are monopolies by design — choosing to run on closed protocols and walling off their users in silos. We need to regulate Facebook and others to force them to open up their application programme interfaces (APIs) to make it possible for users to have access to each other across platforms and services.

Technically, interoperability is possible

There are already examples of digital social systems which don’t operate as walled gardens: email for example. We don’t expect Google to refuse to deliver an email simply because we use an alternative email provider. If I send an email to a Gmail account from my Protonmail, FastMail or even Hotmail account — it goes through. It just works. No message about how I first have to get a Gmail account. This, fundamentally, is the reason email has been so successful for so long. Email uses an open protocol, supported by Google, Microsoft and others (probably due to being early enough, coming about in the heady open days of the web, before data mining and advertising became the dominant forces they are today … although email is increasingly centralised and dominated by Google). While email just works, a technology that’s very similar, such as instant messaging, doesn’t. We have no interoperability, which means many of us have upward of four different chat apps on our phones and have to remember which of our friends are on Twitter, Facebook, WhatsApp (owned by Facebook), Signal, Wire, Telegram, etc.  We don’t carry around five phones so why do we maintain accounts with so many providers, each storing our personal details, each with a different account and password to remember? This is because these messaging apps use their own, closed, proprietary protocols and harm usability and accessibility in the process. This is not in the interests of most people. Interoperability and the use of open protocols would transform this, offering us a better experience and control over our data while reducing our reliance on any one platform. Open protocols can form the basis of a shared digital infrastructure that’s more resilient and would help us keep companies that provide digital services, accountable. It would make it possible to leave and choose whose services we use.

What would this look like in practice?

Say I choose to use a privacy-preserving service for instant messaging, photo sharing and events — possibly one of the many currently available today, or even something I’ve built or host myself. I create an event and I want to invite all my friends, wherever they are. This is where the open protocol and interoperability come in. I have some friends using the same service as me, many more scattered across Twitter, Facebook and other social services, maybe a few just on email. If these services allow interconnections with other services, then every person, wherever they are, will get my event invite and be able to RSVP, receive updates and possibly even comment (depending on what functionality the platforms support). No more getting left out as the cost of caring about privacy. Interoperability would be transformational. It would mean that:
  1. I can choose to keep my photos and data where I have better access, security and portability. This gives us greater control over our data and means that…
  2. Surveillance is harder and more expensive to do. My data will not all be conveniently centralised for corporations or governments to use in unaccountable ways I haven’t agreed to. Privacy ❤
  3. I won’t lose contact with, leave out, or forget friends who aren’t on the same platform as me. I can choose services which serve my needs better, not based on the fear of social exclusion or missing out. Hooray for inclusion and staying friends!
  4. I’ll be less stressed trying to remember and contact people across different platforms with different passwords and accounts (e.g. this currently requires a Facebook event, email, tweets, WhatsApp group reminders and Mastodon, Diaspora and Scuttlebutt posts for siloed communities…)
  5. Alternative services, and their alternative business models and privacy policies become much more viable! Suddenly, a whole ecosystem of innovation and experimentation is possible which is out of reach for us today. (I’m not saying it will be easy. Finding sustainable funding and non-advertising-based business models will still be hard and will require more effort and systemic interventions, but this is a key ingredient).
Especially this last point, the viability of creating alternatives, would start shifting the power imbalance between Facebook and its users (and regulators), making Facebook more accountable and incentivising them to be responsive to user wants and needs. Right now Facebook acts as it pleases because it can — it knows its users are trapped. As soon as people have meaningful choice, exploitation and abuse become much harder and more expensive to maintain.

So, how do we get there?

In the first instance, regulating Facebook, Twitter and others to make them open up their APIs so that other services can read/write to Facebook events, groups, messages etc. would be the first milestone. Yes, this isn’t trivial and there are questions to work out, but it can be done. Looking ahead, investing now in developing open standards for our social digital infrastructure is a must. Funders and governments should be supporting the work and adoption of open protocols and standards — working with open software and services to refine, test and use these standards and see how they work in practice over time. We’ll need governance mechanisms for evolving and investing in our open digital infrastructure that includes diverse stakeholders and accounts for power imbalances between them. We use platforms which have not been co-designed by us and on terms and conditions we have little say over. Investment into alternatives have largely failed outside of more authoritarian countries that have banned or blocked the likes of Google and Facebook. We need to do more to ensure our data and essential services are not in the hands of one or two companies, too big to keep accountable. And after many years of work and discussions on this, I believe openness and decentralisation must play a central role. Redecentralize.org and friends are working on a campaign to figure out how to make this a reality. Is this something you’re working on already or want to contribute and get invited to future workshops and calls? Then ping me on hello@redecentralize.org. The opportunity is huge. By breaking big tech open, we can build a fairer digital future for all, so come get involved! • This blogpost is an reposted version of a post originally published on the Redecentralize blog

Combating other people’s data

- February 18, 2020 in Frictionless Data, Open Knowledge

Frictionless Data Pipelines for Ocean Science

- February 10, 2020 in Frictionless Data, Open Knowledge

This blog post describes a Frictionless Data Pilot with the Biological and Chemical Oceanography Data Management Office (BCO-DMO). Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by the BCO-DMO team members Adam Shepherd, Amber York, Danie Kinkade, and development by Conrad Schloer.   Scientific research is implicitly reliant upon the creation, management, analysis, synthesis, and interpretation of data. When properly stewarded, data hold great potential to demonstrate the reproducibility of scientific results and accelerate scientific discovery. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) is a publicly accessible earth science data repository established by the National Science Foundation (NSF) for the curation of biological, chemical, and biogeochemical oceanographic data from research in coastal, marine, and laboratory environments. With the groundswell surrounding the FAIR data principles, BCO-DMO recognized an opportunity to improve its curation services to better support reproducibility of results, while increasing process efficiencies for incoming data submissions. In 2019, BCO-DMO worked with the Frictionless Data team at Open Knowledge Foundation to develop a web application called Laminar for creating Frictionlessdata Data Package Pipelines that help data managers process data efficiently while recording the provenance of their activities to support reproducibility of results.  
The mission of BCO-DMO is to provide investigators with data management services that span the full data lifecycle from data management planning, to data publication, and archiving.

BCO-DMO provides free access to oceanographic data through a web-based catalog with tools and features facilitating assessment of fitness for purpose. The result of this effort is a database containing over 9,000 datasets from a variety of oceanographic and limnological measurements including those from: in situ sampling, moorings, floats and gliders, sediment traps; laboratory and mesocosm experiments; satellite images; derived parameters and model output; and synthesis products from data integration efforts. The project has worked with over 2,600 data contributors representing over 1,000 funded projects.  As the catalog of data holdings continued to grow in both size and the variety of data types it curates, BCO-DMO needed to retool its data infrastructure with three goals. First, to improve the transportation of data to, from, and within BCO-DMO’s ecosystem. Second, to support reproducibility of research by making all curation activities of the office completely transparent and traceable. Finally, to improve the efficiency and consistency across data management staff. Until recently, data curation activities in the office were largely dependent on the individual capabilities of each data manager. While some of the staff were fluent in Python and other scripting languages, others were dependent on in-house custom developed tools. These in-house tools were extremely useful and flexible, but they were developed for an aging computing paradigm grounded in physical hardware accessing local data resources on disk. While locally stored data is still the convention at BCO-DMO, the distributed nature of the web coupled with the challenges of big data stretched this toolset beyond its original intention. 
In 2015, we were introduced to the idea of data containerization and the Frictionless Data project in a Data Packages BoF at the Research Data Alliance conference in Paris, France. After evaluating the Frictionless Data specifications and tools, BCO-DMO developed a strategy to underpin its new data infrastructure on the ideas behind this project.
While the concept of data packaging is not new, the simplicity and extendibility of the Frictionless Data implementation made it easy to adopt within an existing infrastructure. BCO-DMO identified the Data Package Pipelines (DPP) project in the Frictionless Data toolset as key to achieving its data curation goals. DPP implements the philosophy of declarative workflows which trade code in a specific programming language that tells a computer how a task should be completed, for imperative, structured statements that detail what should be done. These structured statements abstract the user writing the statements from the actual code executing them, and are useful for reproducibility over long periods of time where programming languages age, change or algorithms improve. This flexibility was appealing because it meant the intent of the data manager could be translated into many varying programming (and data) languages over time without having to refactor older workflows. In data management, that means that one of the languages a DPP workflow captures is provenance – a common need across oceanographic datasets for reproducibility. DPP Workflows translated into records of provenance explicitly communicates to data submitters and future data users what BCO-DMO had done during the curation phase. Secondly, because workflow steps need to be interpreted by computers into code that carries out the instructions, it helped data management staff converge on a declarative language they could all share. This convergence meant cohesiveness, consistency, and efficiency across the team if we could implement DPP in a way they could all use.  In 2018, BCO-DMO formed a partnership with Open Knowledge Foundation (OKF) to develop a web application that would help any BCO-DMO data manager use the declarative language they had developed in a consistent way. Why develop a web application for DPP? As the data management staff evaluated DPP and Frictionless Data, they found that there was a learning curve to setting up the DPP environment and a deep understanding of the Frictionlessdata ‘Data Package’ specification was required. The web application abstracted this required knowledge to achieve two main goals: 1) consistently structured Data Packages (datapackage.json) with all the required metadata employed at BCO-DMO, and 2) efficiencies of time by eliminating typos and syntax errors made by data managers.  Thus, the partnership with OKF focused on making the needs of scientific research data a possibility within the Frictionless Data ecosystem of specs and tools. 
Data Package Pipelines is implemented in Python and comes with some built-in processors that can be used in a workflow. BCO-DMO took its own declarative language and identified gaps in the built-in processors. For these gaps, BCO-DMO and OKF developed Python implementations for the missing declarations to support the curation of oceanographic data, and the result was a new set of processors made available on Github.
Some notable BCO-DMO processors are: boolean_add_computed_field – Computes a new field to add to the data whether a particular row satisfies a certain set of criteria.
Example: Where Cruise_ID = ‘AT39-05’ and Station = 6, set Latitude to 22.1645. convert_date – Converts any number of fields containing date information into a single date field with display format and timezone options. Often data information is reported in multiple columns such as `year`, `month`, `day`, `hours_local_time`, `minutes_local_time`, `seconds_local_time`. For spatio-temporal datasets, it’s important to know the UTC date and time of the recorded data to ensure that searches for data with a time range are accurate. Here, these columns are combined to form an ISO 8601-compliant UTC datetime value. convert_to_decimal_degrees –  Convert a single field containing coordinate information from degrees-minutes-seconds or degrees-decimal_minutes to decimal_degrees. The standard representation at BCO-DMO for spatial data conforms to the decimal degrees specification.
reorder_fields –  Changes the order of columns within the data. This is a convention within the oceanographic data community to put certain columns at the beginning of tabular data to help contextualize the following columns. Examples of columns that are typically moved to the beginning are: dates, locations, instrument or vessel identifiers, and depth at collection.  The remaining processors used by BCO-DMO can be found at https://github.com/BCODMO/bcodmo_processors

How can I use Laminar?

In our collaboration with OKF, BCO-DMO developed use cases based on real-world data submissions. One such example is a recent Arctic Nitrogen Fixation Rates dataset.   Arctic dataset  The original dataset shown above needed the following curation steps to make the data more interoperable and reusable:
  • Convert lat/lon to decimal degrees
  • Add timestamp (UTC) in ISO format
  • ‘Collection Depth’ with value “surface” should be changed to 0
  • Remove parenthesis and units from column names (field descriptions and units captured in metadata).
  • Remove spaces from column names
The web application, named Laminar, built on top of DPP helps Data Managers at BCO-DMO perform these operations in a consistent way. First, Laminar prompts us to name and describe the current pipeline being developed, and assumes that the data manager wants to load some data in to start the pipeline, and prompts for a source location. Laminar After providing a name and description of our DPP workflow, we provide a data source to load, and give it the name, ‘nfix’.  In subsequent pipeline steps, we refer to ‘nfix’ as the resource we want to transform. For example, to convert the latitude and longitude into decimal degrees, we add a new step to the pipeline, select the ‘Convert to decimal degrees’ processor, a proxy for our custom processor convert_to_decimal_degrees’, select the ‘nfix’ resource, select a field form that ‘nfix’ data source, and specify the Python regex pattern identifying where the values for the degrees, minutes and seconds can be found in each value of the latitude column. processor step Similarly, in step 7 of this pipeline, we want to generate an ISO 8601-compliant UTC datetime value by combining the pre-existing ‘Date’ and ‘Local Time’ columns. This step is depicted below: date processing step After the pipeline is completed, the interface displays all steps, and lets the data manager execute the pipeline by clicking the green ‘play’ button at the bottom. This button then generates the pipeline-spec.yaml file, executes the pipeline, and can display the resulting dataset. all steps   data The resulting DPP workflow contained 223 lines across this 12-step operation, and for a data manager, the web application reduces the chance of error if this pipelines was being generated by hand. Ultimately, our work with OKF helped us develop processors that follow the DPP conventions.
Our goal for the pilot project with OKF was to have BCO-DMO data managers using the Laminar for processing 80% of the data submissions we receive. The pilot was so successful, that data managers have processed 95% of new data submissions to the repository using the application.
This is exciting from a data management processing perspective because the use of Laminar is more sustainable, and acted to bring the team together to determine best strategies for processing, documentation, etc. This increase in consistency and efficiency is welcomed from an administrative perspective and helps with the training of any new data managers coming to the team.  The OKF team are excellent partners, who were the catalysts to a successful project. The next steps for BCO-DMO are to build on the success of The Fricitonlessdata  Data Package Pipelines by implementing the Frictionlessdata Goodtables specification for data validation to help us develop submission guidelines for common data types. Special thanks to the OKF team – Lilly Winfree, Evgeny Karev, and Jo Barrett.