You are browsing the archive for Lilly Winfree.

Clarifying the semantics of data matrices and results tables: a Frictionless Data Pilot

- July 21, 2020 in Frictionless Data, Genomics, pilot

As part of the Frictionless Data for Reproducible Research project, funded by the Sloan Foundation, we have started a Pilot collaboration with the  Data Readiness Group  at the Department of Engineering Science of the University of Oxford; the group will be represented by Dr. Philippe Rocca-Serra, an Associate Member of Faculty. This Pilot will focus on removing the friction in reported scientific experimental results by applying the Data Package specifications. Written with Dr. Philippe Rocca-Serra. Oxford department of engineering science logo Oxford Data Readiness Group Publishing of scientific experimental results is frequently done in ad-hoc ways that are seldom consistent. For example, results are often deposited as idiosyncratic sets of Excel files or tabular files that contain very little structure or description, making them difficult to use, understand and integrate. Interpreting such tables requires human expertise, which is both costly and slow, and leads to low reuse.  Ambiguous tables of results can lead researchers to rerun analysis or computation over the raw data before they understand the published tables. This current approach is broken, does not fit users’ data mining workflows, and limits meta-analysis. A better procedure for organizing and structuring information would reduce unnecessary use of computational resources, which is where the Frictionless Data project comes into play. This Pilot collaboration aims to help researchers publish their results in a more structured, reusable way. In this Pilot, we will use (and possibly extend) Frictionless tabular data packages to devise both generic and specialized templates. These templates can be used to unambiguously report experimental results. Our short term goal from this work is to develop a set of Frictionless Data Packages for targeted use cases where impact is high. We will first focus first on creating templates for statistical comparison results, such as differential analysis, enrichment analysis, high-throughput screens, and univariate comparisons, in genomics research by using the STATO ontology within tabular data packages.  Our longer term goals are that these templates will be incorporated into publishing systems to allow for more clear reporting of results, more knowledge extraction, and more reproducible science.  For instance, we anticipate that this work will allow for increased consistency of table structure in publications, as well as increased data reuse owing to predictable syntax and layout. We also hope this work will ease creation of linked data graphs from table of results due to clarified semantics.  An additional goal is to create code that is compatible with R’s ggplot2 library, which would allow for easy generation of data analysis plots.  To this end, we plan on working with R developers in the future to create a package that will generate Frictionless Data compliant data packages.  This work has recently begun, and will continue throughout the year. We have already met with some challenges, such as working on ways to transform, or normalize, data and ways to incorporate RDF linked data (you can read our related conversations in GitHub). We are also working on how to define a ‘generic’ table layout definition, which is broad enough to be reused in as wide a range of situation as possible. If you are interested in staying up to date on this work, we encourage you to check out these GitHub repositories: https://gitlab.com/datascriptor/datascriptor-fldatapackages and https://github.com/ISA-tools/frictionless-collab. Additionally, we will (virtually) be at the eLife Sprint in September to work on closely related work, which you can read about here: https://sprint.elifesciences.org/data-paper-skeleton-tools-for-life-sciences/.  Throughout this Pilot, we are planning on reaching out to the community to test these ideas and get feedback. Please contact us on GitHub or in Discord if you are interested in contributing.

Reflecting on the first cohort of Frictionless Data Reproducible Research fellows

- June 9, 2020 in Frictionless Data

It is truly bittersweet to say that we are at the end of the first cohort of the Frictionless Data Reproducible Research fellows. Over the past nine months, I have had the pleasure of working with Monica Granados, Selene Yang, Daniel Ouso and Lily Zhao during the fellows programme. Combining their diverse backgrounds (from government data to mapping data, from post-PhD to graduate student), they have spent many hours together learning how to advocate for open science and how to use the Frictionless Data code and tools to make their data more reproducible. Together, they have also written several blogposts, presented a talk and given a workshop. And they did all of this during a global pandemic! I feel lucky to have worked with them, and will be eagerly watching their contributions to the open science space. Each fellow wrote a final blogpost reflecting on their time with the programme. You can read the originals here, and I have also republished them below:

Lily Zhao: Reflecting on my time as a fellow

As one of the inaugural Reproducible Research Fellows of Frictionless Data, I am eager to share my experience of the program with you about working with Sele, Ouso and Monica under the leadership of Lilly Winfree this year. I could not have asked for a better group of individuals to work remotely with. Sele, Ouso, Monica and I spent the last nine months discussing common issues in research reproducibility and delving into the philosophy behind open data science. Together we learned to apply Frictionless Data tools to our own data and mastered techniques for streamlining the reproducibility of our own research process. Lilly was an excellent mentor throughout the program and was always there to help with any issues we ran into. This was also one of my first experiences working entirely remotely on a team across many time zones. Through the use of Google hangout, Zoom and Slack the entire process was easier than I ever thought it could be. It is wonderful that through technology we are able to collaborate across the world easier than ever before. We were also able to give multiple presentations together. Monica and I were joint speakers as part of the csv conference where we talk about our experience as fellows, and our experience using Frictionless Data tools. With so many people on the Zoom call it really felt like were part of a large community. The four of us also led a hands-on workshop introducing the Data Package Creator and GoodTables web interface tools. This was especially fun for me because we used a subset of my French Polynesia interview data as practice data for all workshop participants. Many of the questions asked by participants mirrored questions the four of us had already worked through together, so it was great to be able to share what we had learned with others. I look forward to sharing these tools and the philosophy of open data science throughout my career and am very grateful to the Open Knowledge Foundation for this amazing learning opportunity. If you would like to learn more about my experience in the Frictionless Data Fellows program please feel free to reach out to me personally! Monica, Sele, Lilly, Ouso and I on our most recent conference call :)

Monica Granados: Gimme Dat Data (in a validated Data Package)

As a scientist I collect a lot of data. Especially about animals that live in the water – fish, mussels, crayfish. This data is not only useful to me but it can be used by others to improve the power in their studies, increase geographic range or phylogenetic diversity for example. Prior to the Frictionless Data for Reproducible Research Fellowship, I had my data on GitHub along with a script that would use rcurl to pull the data from the repository. While the repository was accompanied by a README, the file didn’t have much information other than the manuscript which included the data. This structure facilitated reproducibility but not reusability. Conceivably if you wanted to use my data for your own experiments you could have contextualised the data by using the relevant manuscript, but it still would have been a challenge without any metadata, not to mention any potential structural errors you could have encountered that I didn’t catch when I uploaded the data. It was through the introduction of Frictionless Tools, however that I realised that there was more I could do to make my science even more transparent, reproducible and reusable. The fellowship syllabus was structured in such a way that by learning about the tools we learned what the tools were facilitating – better data sharing. The fellows would learn how to use the tool through a self guided lesson and then answer questions on Slack which asked us to interrogate why the tool was built the way it was. These lessons were also supported by calls with the full cohort of fellows where we discussed what we had learned, problems we were encountering as we used the tools with our own data and reviewed papers on open science. The fellowship culminated with a workshop delivered by all four fellows attended by over 40 participants and a presentation at csv,conf. Now when I share data as a data package I know I have validated by tabular data for structural errors and the file contains metadata that contextualises the data. Having the opportunity to be a part of the inaugural cohort has been a wonderful experience. I learned new tools and information that I will take and share for the rest of my career, but also gained new colleagues and open science friends in my fellow fellows.

Daniel Ouso: Better Data, one resource at a time – my fellowship experience

Getting into the Frictionless Data fellowship

My background is largely in molecular biology, particularly infection diagnostics targeting arthropod viruses, bacteria and protozoa. I have a relatively shorter bioinformatics experience, but this is the direction am passionate to build my research occupation in. I first heard about Frictionless data from the African Carpentries instructors’ mailing list. It was the inaugural fellowship call that had been shared by Anelda. I caught it at the nick of time; deadline submission! By the way, you can watch for annual calls and other interesting stuff by following @frictionlessd8a. The call for the second cohort just closed in June and was open from late April. The fellowship starts in September.

On-boarding

Lilly arranged the first-time meeting to usher me into the fellow, after a few email correspondence. I got introduced Jo Barrat who patiently took me through my paces completing logistical preliminaries. I was really looking forward to getting started. The on-boarding enabled acquaintance with the rest of the fellows, awesome people. I was excited!

Context

Overall, the world is in search of and is promoting better ways to work with data, whether it is collecting data or accessibility or novel ways to analyse high-throughput data or dedicated workflows to publish data alongside accustomed scientific publishing or moving/working with data across frameworks or merely storage and security. All these, plus other factors, provide avenues to exhaustively interrogate data in multiple ways, thus promoting improved data usefulness. This has been arguably under-appreciated in times past. Frictionless data, through its Progressive Data Toolkit and with the help of organisations like OKF and funding by Sloan Foundation, is dedicated to alleviating hindrances to some of the aforementioned efforts. People empowerment is a core resource to the #BetterData dream.

The fellowship

An aspect of any research is the collection of data, which is applied to test hypotheses under study. The underlying importance of data, good data for that matter, in research is therefore unquestionable. Approaches to data analysis may differ from field to field, yet there are conventional principles that do not discriminate fields; such are the targets to Frictionless Data. I jump at the opportunity to learn ways to ramp up my data workflow efficiency, with a touch of research openness and reproducibility. The journey took off withdrawing a meticulous roadmap, which I found very helpful, and seem to end with this – sharing my experience. In between exciting things happened. In case one was coming in a little rusty with their basic Python/R, they were catered for early on, though you didn’t exactly need them to use the tools. To say, literally, ZERO programming skills prerequisite. There were a plethora of resources, and help from the fellows, not to mention from the ever welcoming Lilly. The core sections of the Fellowship were prefaced by grasping basic components like the JSON schema data interchange format. Following were the core tools and their specifications. The Data Package Creator tool is impressively emphatic on capturing metadata, a backbone theme for reproducibility. I found Table Schema and Schema specifications initially confusing. Other fellows and I have previously shared on the Data Package Creator and GoodTables, tools for creating and validating data packages respectively. These tools are very progressive, continually incorporating feedback from the community, including fellows, to improve user experience. So don’t be surprised at a few changes since the fellows’ blogs. In fact, a new entrant, which I only knew of recently, is the DataHub tool – “Is a useful solution for sharing datasets, and discovering high-quality datasets that others have produced”. I am yet to check it out. Besides the main focus of the fellowship, I got to learn a lot covering organisational skills and tools such as GitHub projects, Toggl for time-monitoring, general remote working, among others. I got introduced to new communities/initiatives such as PREreview; my first time to participate in open research reviewing. The fellows were awesome to work with and Lilly Winfree provided the best mentorship. Sometimes problems are foreseen and contingencies planned, other times unforeseen surprises rear their heads into our otherwise “perfect” plan. Guess what? You nailed it! COVID-19. Such require adaptability akin to that of the fictional El Professor in the Money Heist. Since we could not organise the in-person seminar and/or workshops as part of the fellowship, we collectively adopted a virtual workshop. It went amazingly well.

What next

Acquired knowledge and skills become more useful when implemented. My goal is to apply them in every opportune opening and to keep learning other integrative tools. Yet, there is also this about knowledge; it is to be spread. I hope to compensate for suspended social sessions and to keep engagement with @frictionlessd8a to continue open and reproducible research advocacy.

Conclusion

Tools that need minimal to no coding experience support well the adoption of good data hygiene practices, more so in places with scanty coding expertise. The FD tools will surely help your workflows with some greasing regardless of your coding proficiency, especially for tabular data. This is especially needful seeing the deluge of data persistently churned out from various sources. Frictionless Data is for everyone working with data; researcher, data scientist or data engineers. The ultimate goal is to work with data in an open and reproducible way, which is consistent with modern scientific research practice. A concerted approach is also key, I am glad to have represented Africa in the fellowship. Do not hesitate to reach out if you think I can be resourceful to your cause.

Sele Yang: A seguir reproduciendo conocimiento!

Termina un gran proceso para la primera cohorte del Frictionless Data for Reproducible Research Fellowship. Un proceso de grandísimos y valiosos aprendizajes que, sí y sólo sí pudieron darse, gracias al trabajo colaborativo entre todas las personas que participaron. En un inicio, recuerdo el gran miedo (que de alguna forma todavía persiste, pero más levemente) de no contar con las habilidades técnicas requeridas para poder llevar a cabo mi proyecto, pero poco a poco fui conociendo y sintiéndome apoyada por mis compañeros y compañeras, que con muchísima paciencia me llevaron de la mano para no perderme en el proceso. Recorrí gracias a este equipo, las playas de M’orea con los datos de Lily, aprendí de formas de investigación por fuera de mi campo de experiencia con Ouso y Mónica. Reconocí el gran trabajo que realizan investigadores e investigadoras para defender el conocimiento más abierto, equitativo y accesible. Si bien nuestro recorrido compartido termina acá, puedo resaltar que a pesar de la crisis que nos llevó a cambiar muchas acciones con COVID-19 durante nuestro programa, logramos encontrarnos aunque fuese virtualmente, no sólo para compartir entre nosotres, sino también con una gran audiencia para nuestro taller sobre el uso de herramientas y metodologías del programa. Una gran actividad para reforzar la importancia de compartir conocimiento, y hacerlo más accesible, mucho más en tiempos de crisis. Agradezco al Open Knowledge Foundation por haber llevado a cabo este programa, y les invito a todas las personas a que recorran la información que produjimos durante estos meses de trabajo. Termino este proceso de aprendizaje con la convicción todavía más fuerte sobre lo necesario qué son los procesos colaborativos que buscan aperturar y democratizar la ciencia y el conocimiento. Mucho más en estos tiempos en los que la colaboración y puesta en común del aprendizaje nos hará más fuertes como sociedad.

Enough is enough: solidarity with the Black community and Black Lives Matter

- June 4, 2020 in Open Knowledge Foundation

The mission of the Open Knowledge Foundation is to work for a fair, free and open future for all, not just some. We stand against the racism, injustices, and inequalities plaguing our world today. It is our responsibility to use our platform to elevate marginalised voices and take action against racism. We ask that the Open Knowledge Foundation community comes together to support the Black community. Black lives matter. It is important to me that we acknowledge the ongoing violence against the Black community, including the recent murder of George Floyd, and work towards ways to dismantle systemic racism. This also means taking a look inward at our work at the Open Knowledge Foundation – how can we better encourage diversity and support marginalised communities? As I sit in Austin, Texas in the USA, which is the traditional territory of the Tonkawa and Comanche Peoples, I won’t pretend to have a good answer right now. I am trying to learn and listen, and since I feel we cannot be silent, I am hoping to encourage dialogue and amplify others’ voices.  I’m also looking towards the work that the Open Knowledge Foundation can do to make the future more fair, free and open for all. Within OKF, I work with open data – teaching people how to manage their data, working on solutions to clean messy data, discussing standards and best practices. Data can be thought of as facts  but data is not neutral. At OKF, we push for all non-personal data to be open, meaning it can be freely (and easily) accessed by anyone for any purpose.  What would the world look like if more data was open? For one thing, government and city policies would be more transparent. We could more easily and dynamically show which communities are being negatively impacted by local policies and then use that information to inform new policies and drive change. Going forward, I’m committing to ask: how can we better use data to empower marginalised communities?  Currently at OKF, we are working to understand bias in machine learning algorithms via our Justice Programme, which investigates topics such as how AI amplify systemic bias in the court system. We are not the only group working on these projects. I recently learned about Data for Black Lives, which is a group working to use data to enact real change for Black people, such as how structural racism is impacting the COVID-19 crisis. Locally in Austin, Measure is a not-for-profit organisation that works to use data to advocate for underserved communities. Here is their timely research and proposal on community policing. I write this in the hope that we can start a conversation with the OKF community and provide a space to amplify minority voices. It is also imperative that we look inward and identify where we are currently failing. For example, historically our Advisory Board has not been diverse, but we are actively working to change this. What does diversity look like for the Open Knowledge Foundation? How can we make practical changes and what framework will underpin this? We are having an all-team meeting next week to discuss these questions and create a plan of action. Here are some actions I am taking that I encourage others to participate in:
  • Donate to anti-white-supremacy organisations
  • Support local Black businesses (here is a list of Black-owned bookstores in the USA: https://aalbc.com/bookstores/list.php)
  • Promote work by Black creators
  • Call your local legislators and ask them to promote and pass police reform policies
  • Proactively educate yourself and learn from your mistakes
Here are some books and resources that others have shared with me: Do you have other resources you would like to share with our community? Please post a comment. Do you have other suggestions for meaningful actions we can take at the Open Knowledge Foundation to support the Black community? Please let us know. We are trying to listen, learn, and create a space for the community to have their voices heard as we aim to create a more fair world for everyone.

Enough is enough: solidarity with the Black community and Black Lives Matter

- June 4, 2020 in Open Knowledge Foundation

The mission of the Open Knowledge Foundation is to work for a fair, free and open future for all, not just some. We stand against the racism, injustices, and inequalities plaguing our world today. It is our responsibility to use our platform to elevate marginalised voices and take action against racism. We ask that the Open Knowledge Foundation community comes together to support the Black community. Black lives matter. It is important to me that we acknowledge the ongoing violence against the Black community, including the recent murder of George Floyd, and work towards ways to dismantle systemic racism. This also means taking a look inward at our work at the Open Knowledge Foundation – how can we better encourage diversity and support marginalised communities? As I sit in Austin, Texas in the USA, which is the traditional territory of the Tonkawa and Comanche Peoples, I won’t pretend to have a good answer right now. I am trying to learn and listen, and since I feel we cannot be silent, I am hoping to encourage dialogue and amplify others’ voices.  I’m also looking towards the work that the Open Knowledge Foundation can do to make the future more fair, free and open for all. Within OKF, I work with open data – teaching people how to manage their data, working on solutions to clean messy data, discussing standards and best practices. Data can be thought of as facts  but data is not neutral. At OKF, we push for all non-personal data to be open, meaning it can be freely (and easily) accessed by anyone for any purpose.  What would the world look like if more data was open? For one thing, government and city policies would be more transparent. We could more easily and dynamically show which communities are being negatively impacted by local policies and then use that information to inform new policies and drive change. Going forward, I’m committing to ask: how can we better use data to empower marginalised communities?  Currently at OKF, we are working to understand bias in machine learning algorithms via our Justice Programme, which investigates topics such as how AI amplify systemic bias in the court system. We are not the only group working on these projects. I recently learned about Data for Black Lives, which is a group working to use data to enact real change for Black people, such as how structural racism is impacting the COVID-19 crisis. Locally in Austin, Measure is a not-for-profit organisation that works to use data to advocate for underserved communities. Here is their timely research and proposal on community policing. I write this in the hope that we can start a conversation with the OKF community and provide a space to amplify minority voices. It is also imperative that we look inward and identify where we are currently failing. For example, historically our Advisory Board has not been diverse, but we are actively working to change this. What does diversity look like for the Open Knowledge Foundation? How can we make practical changes and what framework will underpin this? We are having an all-team meeting next week to discuss these questions and create a plan of action. Here are some actions I am taking that I encourage others to participate in:
  • Donate to anti-white-supremacy organisations
  • Support local Black businesses (here is a list of Black-owned bookstores in the USA: https://aalbc.com/bookstores/list.php)
  • Promote work by Black creators
  • Call your local legislators and ask them to promote and pass police reform policies
  • Proactively educate yourself and learn from your mistakes
Here are some books and resources that others have shared with me: Do you have other resources you would like to share with our community? Please post a comment. Do you have other suggestions for meaningful actions we can take at the Open Knowledge Foundation to support the Black community? Please let us know. We are trying to listen, learn, and create a space for the community to have their voices heard as we aim to create a more fair world for everyone.

Join the Frictionless Data workshop – 20 May

- April 28, 2020 in Frictionless Data

  Join us on 20 May at 4pm UK/10am CDT for a Frictionless Data workshop led by the Reproducible Research Fellows! This 1.5 hour long workshop will cover an introduction to the open source Frictionless Data tools. Participants will learn about data wrangling, including how to document metadata, package data into a datapackage, write a schema to describe data, and validate data. The workshop is suitable for beginners and those looking to learn more about using Frictionless Data.

Everyone is welcome to join, but you must register to attend using this link

The Fellows Programme is part of the Frictionless Data for Reproducible Research project overseen by the Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata. This workshop will be led by the members of the First Cohort of the Fellows Programme: Lily Zhao, Daniel Ouso, Monica Granados, and Selene Yang. You can read more about their work during this programme here: http://fellows.frictionlessdata.io/blog/. Additionally, applications are now open for the Second Cohort of Fellows. Read more about applying here: https://blog.okfn.org/2020/04/27/apply-now-to-become-a-frictionless-data-reproducible-research-fellow/

Apply now to become a Frictionless Data Reproducible Research Fellow

- April 27, 2020 in Frictionless Data

Frictionless Data Logo The Frictionless Data Reproducible Research Fellows Program, supported by the Sloan Foundation, aims to train graduate students, postdoctoral scholars, and early career researchers how to become champions for open, reproducible research using Frictionless Data tools and approaches in their field. Apply today to join the Second Cohort of Frictionless Data Fellows! Fellows will learn about Frictionless Data, including how to use Frictionless tools in their domains to improve reproducible research workflows, and how to advocate for open science.  Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content. In addition to mentorship, we are providing Fellows with stipends of $5,000 to support their work and time during the nine-month long Fellowship. We welcome applications using this form from 27th April until 1 June 2020, with the Fellowship starting in the late Summer. We value diversity and encourage applicants from communities that are under-represented in science and technology, people of colour, women, people with disabilities, and LGBTI+ individuals.

Frictionless Data for Reproducible Research

The Fellowship is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation, and is the second iteration. Frictionless Data aims to reduce the friction often found when working with data, such as when data is poorly structured, incomplete, hard to find, or is archived in difficult to use formats. This project, funded by the Sloan Foundation, applies our work to data-driven research disciplines, in order to help researchers and the research community resolve data workflow issues.  At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata. The Frictionless Data approach aims to address identified needs for improving data-driven research such as generalized, standard metadata formats, interoperable data, and open-source tooling for data validation.

Frictionless Data for Reproducible Research Fellows Programme

Fellowship program

The First Cohort of Fellows ran from Fall 2019 til Summer 2020, and you can read more about their work in the Fellows blog: http://fellows.frictionlessdata.io/blog/.  During the Fellowship, our team will be on hand to work closely with you as you complete the work. We will help you learn Frictionless Data tooling and software, and provide you with resources to help you create workshops and presentations. Also, we will announce Fellows on the project website and will be publishing your blogs and workshops slides within our network channels.  We will provide mentorship on how to work on an Open project, and will work with you to achieve your Fellowship goals.

How to apply

The Fund is open to early career research individuals, such as graduate students and postdoctoral scholars, anywhere in the world, and in any scientific discipline. Successful applicants will be enthusiastic about reproducible research and open science, have some experience with communications, writing, or giving presentations, and have some technical skills (basic experience with Python, R, or Matlab for example), but do not need to be technically proficient. If you are interested, but do not have all of the qualifications, we still encourage you to apply. If you have any questions, please email the team at frictionlessdata@okfn.org, ask a question on the project’s gitter channel, or check out the Fellows FAQ section. Apply soon, and share with your networks!

Announcing Frictionless Data Community Virtual Hangout – 20 April

- April 16, 2020 in Frictionless Data

Photo by William White on Unsplash We are thrilled to announce we’ll be co-hosting a virtual community hangout with Datopian to share recent developments in the Frictionless Data community. This will be a 1-hour meeting where community members come together to discuss key topics in the data community. Here are some key discussions we hope to cover:
  • Introductions & share the purpose of this hangout.
  • Share the update on the new website release and general Frictionless Data related updates.
  • Have community members share their thoughts and general feedback on Frictionless Data.
  • Share information about CSV Conf.
The hangout is scheduled to happen on 20th April 2020 at 5 pm CET. If you would like to attend, you can sign up for the event in advance here. Everyone is welcome. Looking forward to seeing you there!

Frictionless Public Utility Data: A Pilot Study

- March 18, 2020 in Open Knowledge

This blog post describes a Frictionless Data Pilot with the Public Utility Data Liberation project. Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by Zane Selvans, Christina Gosnell, and Lilly Winfree. The Public Utility Data Liberation project, PUDL, aims to make US energy data easier to access and use. Much of this data, including information about the cost of electricity, how much fuel is being burned, powerplant usage, and emissions, is not well documented or is in difficult to use formats. Last year, PUDL joined forces with the Frictionless Data for Reproducible Research team as a Pilot project to release this public utility data. PUDL takes the original spreadsheets, CSV files, and databases and turns them into unified Frictionless tabular data packages that can be used to populate a database, or read in directly with Python, R, Microsoft Access, and many other tools.   

What is PUDL?

The PUDL project, which is coordinated by Catalyst Cooperative, is focused on creating an energy utility data product that can serve a wide range of users. PUDL was inspired to make this data more accessible because the current US utility data ecosystem fragmented, and commercial products are expensive. There are hundreds of gigabytes of information available from government agencies, but they are often difficult to work with, and different sources can be hard to combine. PUDL users include researchers, activists, journalists, and policy makers. They have a wide range of technical backgrounds, from grassroots organizers who might only feel comfortable with spreadsheets, to PhDs with cloud computing resources, so it was important to provide data that would work for all users.  Before PUDL, much of this data was freely available to download from various sources, but it was typically messy and not well documented. This led to a lack of uniformity and reproducibility amongst projects that were using this data. The users were scraping the data together in their own way, making it hard to compare analyses or understand outcomes. Therefore, one of the goals for PUDL was to minimize these duplicated efforts, and enable the creation of lasting, cumulative outputs.

What were the main Pilot goals?

The main focus of this Pilot was to create a way to openly share the utility data in a reproducible way that would be understandable to PUDL’s many potential users. The first change Catalyst identified they wanted to make during the Pilot was with their data storage medium. PUDL was previously creating a Postgresql database as the main data output. However many users,  even those with technical experience, found setting up the separate database software a major hurdle that prevented them from accessing and using the processed data. They also desired a static, archivable, platform-independent format. Therefore, Catalyst decided to transition PUDL away from PostgreSQL, and instead try Frictionless Tabular Data Packages. They also wanted a way to share the processed data without needing to commit to long-term maintenance and curation, meaning they needed the outputs to continue being useful to users even if they only had minimal resources to dedicate to the maintenance and updates. The team decided to package their data into Tabular Data Packages and identified Zenodo as a good option for openly hosting that packaged data. Catalyst also recognized that most users only want to download the outputs and use them directly, and did not care about reproducing the data processing pipeline themselves, but it was still important to provide the processing pipeline code publicly to support transparency and reproducibility. Therefore, in this Pilot, they focused on transitioning their existing ETL pipeline from outputting a PostgreSQL database, that was defined using SQLAlchemy, to outputting datapackages which could then be archived publicly on Zenodo. Importantly, they needed this pipeline to maintain the metadata, information about data type, and database structural information that had already been accumulated. This rich metadata needed to be stored alongside the data itself, so future users could understand where the data came from and understand its meaning. The Catalyst team used Tabular Data Packages to record and store this metadata (see the code here: https://github.com/catalyst-cooperative/pudl/blob/master/src/pudl/load/metadata.py). Another complicating factor is that many of the PUDL datasets are fairly entangled with each other. The PUDL team ideally wanted users to be able to pick and choose which datasets they actually wanted to download and use without requiring them to download it all (currently about 100GB of data when uncompressed). However, they were worried that if single datasets were downloaded, the users might miss that some of the datasets were meant to be used together. So, the PUDL team created information, which they call “glue”,  that shows which datasets are linked together and that should ideally be used in tandem.  The cumulation of this Pilot was a release of the PUDL data (access it here – https://zenodo.org/record/3672068 and read the corresponding documentation here – https://catalystcoop-pudl.readthedocs.io/en/v0.3.2/), which includes integrated data from the EIA Form 860, EIA Form 923, The EPA Continuous Emissions Monitoring System (CEMS), The EPA Integrated Planning Model (IPM), and FERC Form 1.

What problems were encountered during this Pilot?

One issue that the group encountered during the Pilot was that the data types available in Postgres are substantially richer than those natively in the Tabular Data Package standard. However, this issue is an endemic problem of wanting to work with several different platforms, and so the team compromised and worked with the least common denominator.  In the future, PUDL might store several different sets of data types for use in different contexts, for example, one for freezing the data out into data packages, one for SQLite, and one for Pandas.  Another problem encountered during the Pilot resulted from testing the limits of the draft Tabular Data Package specifications. There were aspects of the specifications that the Catalyst team assumed were fully implemented in the reference (Python) implementation of the Frictionless toolset, but were in fact still works in progress. This work led the Frictionless team to start a documentation improvement project, including a revision of the specifications website to incorporate this feedback.  Through the pilot, the teams worked to implement new Frictionless features, including the specification of composite primary keys and foreign key references that point to external data packages. Other new Frictionless functionality that was created with this Pilot included partitioning of large resources into resource groups in which all resources use identical table schemas, and adding gzip compression of resources. The Pilot also focused on implementing more complete validation through goodtables, including bytes/hash checks, foreign keys checks, and primary keys checks, though there is still more work to be done here.

Future Directions

A common problem with using publicly available energy data is that the federal agencies creating the data do not use version control or maintain change logs for the data they publish, but they do frequently go back years after the fact to revise or alter previously published data — with no notification. To combat this problem, Catalyst is using data packages to encapsulate the raw inputs to the ETL process. They are setting up a process which will periodically check to see if the federal agencies’ posted data has been updated or changed, create an archive, and upload it to Zenodo. They will also store metadata in non-tabular data packages, indicating which information is stored in each file (year, state, month, etc.) so that there can be a uniform process of querying those raw input data packages. This will mean the raw inputs won’t have to be archived alongside every data release. Instead one can simply refer to these other versioned archives of the inputs. Catalyst hopes these version controlled raw archives will also be useful to other researchers. Another next step for Catalyst will be to make the ETL and new dataset integration more modular to hopefully make it easier for others to integrate new datasets. For instance, they are planning on integrating the EIA 861 and the ISO/RTO LMP data next. Other future plans include simplifying metadata storage, using Docker to containerize the ETL process for better reproducibility, and setting up a Pangeo  instance for live interactive data access without requiring anyone to download any data at all. The team would also like to build visualizations that sit on top of the database, making an interactive, regularly updated map of US coal plants and their operating costs, compared to new renewable energy in the same area. They would also like to visualize power plant operational attributes from EPA CEMS (e.g., ramp rates, min/max operating loads, relationship between load factor and heat rate, marginal additional fuel required for a startup event…).  Have you used PUDL? The team would love to hear feedback from users of the published data so that they can understand how to improve it, based on real user experiences. If you are integrating other US energy/electricity data of interest, please talk to the PUDL team about whether they might want to integrate it into PUDL to help ensure that it’s all more standardized and can be maintained long term. Also let them know what other datasets you would find useful (E.g. FERC EQR, FERC 714, PHMSA Pipelines, MSHA mines…).  If you have questions, please ask them on GitHub (https://github.com/catalyst-cooperative/pudl) so that the answers will be public for others to find as well.

Tracking the Trade of Octopus (and Packaging the Data)

- March 13, 2020 in Frictionless Data, Open Knowledge

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/. By Lily Zhao

Introduction

When I started graduate school, I was shocked to learn that seafood is actually the most internationally traded food commodity in the world. In fact, the global trade in fish is worth more than the trades of tea, coffee and sugar combined (Fisheries FAO, 2006). However, for many developing countries being connected to the global seafood market can be a double-edged sword. It is true global trade has the potential to redistribute some wealth and improve the livelihoods of fishers and traders in these countries. But it can also promote illegal trade and overfishing, which can harm the future sustainability of a local food source. Over the course of my master’s degree, I developed a passion for studying these issues, which is why I am excited to share with you my experience turning some of the data my collaborators into a packaged dataset using the Open Knowledge Foundation’s Datapackage tool. These data provide a snapshot into the global market for octopus and how it is traded throughout and between Kenya, Tanzania and Mozambique before heading to European markets. This research project was an international collaboration between the Stockholm Resilience Centre in Sweden, the National Institute for Medical Research, of Tanzania, Pwani University in Kilifi, Kenya and the School of Marine and Environmental Affairs at the University of Washington. These data eventually became my master’s thesis and this data package will complement a forthcoming publication of our findings. Specifically, these data are the prices and quantities at which middlemen in Tanzania and Kenya reported buying and selling octopus. These data are exciting because they not only inform our understanding of who is benefiting from the trade of octopus by also could assist in improving the market price octopus in Tanzania. This is because value chain information can help Tanzania’s octopus fishery along its path to Marine Stewardship Council seafood certification. Seafood that gets the Marine Stewardship Council Label gains a certain amount of credibility which in turn can increase profit. For developing countries, this seafood label can provide a monetary incentive for improving fisheries management. But before Tanzania’s octopus fishery can get certified, they will need to prove they can trace the flow of their octopus supply chain, and manage it sustainably. We hope that this packaged dataset will ultimately inform this effort.

Getting the data

To gather the data my field partner Chris and I went to 10 different fishing communities like this one. mtwara

Middlemen buy and sell seafood in Mtwara, Tanzania.

We went on to interview all the major exporters of octopus in both Tanzania and Kenya and spoke with company agents and octopus traders who bought their octopus from 570 different fishermen. With these interviews were able to account for about 95% of East Africa’s international octopus market share. Octopus

My research partner- Chris Cheupe, and I at an octopus collection point.

Creating the Data Package

The datapackage tool was created by the Open Knowledge Foundation to compile our data and metadata in a compact unit, making it easier and more efficient for others to access. You can create the data package using the online platform or using the Python or R programming software libraries. I had some issues using the R package instead of the online tool initially, which may have been related to the fact that the original data file was not utf-8 encoded. But stay tuned! For now, I made my datapackage using the Data Package Creator online tool. The tool helped me create a schema that outlines the data’s structure including a description of each column. The tool also helps you outline the metadata for the dataset as a whole, including information like the license and author. Our dataset has a lot of complicated columns and the tool gave me a streamlined way to describe each column via the schema. Afterwords, I added the metadata using the lefthand side of the browser tool and checked to make sure that the data package was valid!   valid data

The green bar at the top of the screenshot indicates validity

If the information you provide for each column does not match the data within the columns the package will not validate and instead, you will get an error like this: invalid data  

The red bar at the top of the screenshot indicates invalidity

Checkout my final datapackage by visiting my github repository!

Reference:

Fisheries, F. A. O. (2006). The state of world fisheries and aquaculture 2006.

Announcing the 2020 Frictionless Data Tool Fund

- March 2, 2020 in Frictionless Data

Apply for a mini-grant to build an open source tool for reproducible research using Frictionless Data tooling, specs, and code base.

Today, Open Knowledge Foundation is launching the second round of the Frictionless Data Tool Fund, a mini-grant scheme offering grants of $5,000 to support individuals or organisations in developing an open tool for reproducible science or research built using the Frictionless Data specifications and software. We welcome submissions of interest until 17th May 2020. The Tool Fund is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts. At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata.  With this announcement we are looking for individuals or organizations of scientists, researchers, developers, or data wranglers to build upon our existing open source tools and code base to create novel tooling for reproducible research. We will prioritize tools focusing on the following fields/domains of science: biology, genetics, neuroscience, ecology, geology, and bioinformatics. The fund will be accepting submissions from now until 1st May, with projects starting mid-June and to be completed by the end of the year. This builds on the success of the 2019 Tool Fund, which funded the creation of four tools: a tool to convert the biodiversity DarwinCore Archive into Frictionless data packages; a tool that bundles Open Referral data as data packages; a tool to export Neuroscience Experiments System data as data packages; and a tool to import and export data packages in Google Sheets. For this year’s Tool Fund, we would like the community to work on tools that can make a difference to researchers and scientists in the following domains: biology, genetics, neuroscience, ecology, geology, and bioinformatics.  Applications can be submitted by filling out this form by 1st May. The Frictionless Data team will notify all applicants whether they have been successful or not at the very latest by mid-June. Successful candidates will then be invited for interviews before the final decision is given. We will base our choice on evidence of technical capabilities and also favour applicants who demonstrate an interest in practical use of the Frictionless Data Specifications. Preference will also be given to applicants who show an interest working with and maintaining these tools going forward. For more questions on the fund, speak directly to us on our forum, on our Gitter chat or email us at frictionlessdata@okfn.org.