You are browsing the archive for Lilly Winfree.

Frictionless Public Utility Data: A Pilot Study

- March 18, 2020 in Open Knowledge

This blog post describes a Frictionless Data Pilot with the Public Utility Data Liberation project. Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by Zane Selvans, Christina Gosnell, and Lilly Winfree. The Public Utility Data Liberation project, PUDL, aims to make US energy data easier to access and use. Much of this data, including information about the cost of electricity, how much fuel is being burned, powerplant usage, and emissions, is not well documented or is in difficult to use formats. Last year, PUDL joined forces with the Frictionless Data for Reproducible Research team as a Pilot project to release this public utility data. PUDL takes the original spreadsheets, CSV files, and databases and turns them into unified Frictionless tabular data packages that can be used to populate a database, or read in directly with Python, R, Microsoft Access, and many other tools.   

What is PUDL?

The PUDL project, which is coordinated by Catalyst Cooperative, is focused on creating an energy utility data product that can serve a wide range of users. PUDL was inspired to make this data more accessible because the current US utility data ecosystem fragmented, and commercial products are expensive. There are hundreds of gigabytes of information available from government agencies, but they are often difficult to work with, and different sources can be hard to combine. PUDL users include researchers, activists, journalists, and policy makers. They have a wide range of technical backgrounds, from grassroots organizers who might only feel comfortable with spreadsheets, to PhDs with cloud computing resources, so it was important to provide data that would work for all users.  Before PUDL, much of this data was freely available to download from various sources, but it was typically messy and not well documented. This led to a lack of uniformity and reproducibility amongst projects that were using this data. The users were scraping the data together in their own way, making it hard to compare analyses or understand outcomes. Therefore, one of the goals for PUDL was to minimize these duplicated efforts, and enable the creation of lasting, cumulative outputs.

What were the main Pilot goals?

The main focus of this Pilot was to create a way to openly share the utility data in a reproducible way that would be understandable to PUDL’s many potential users. The first change Catalyst identified they wanted to make during the Pilot was with their data storage medium. PUDL was previously creating a Postgresql database as the main data output. However many users,  even those with technical experience, found setting up the separate database software a major hurdle that prevented them from accessing and using the processed data. They also desired a static, archivable, platform-independent format. Therefore, Catalyst decided to transition PUDL away from PostgreSQL, and instead try Frictionless Tabular Data Packages. They also wanted a way to share the processed data without needing to commit to long-term maintenance and curation, meaning they needed the outputs to continue being useful to users even if they only had minimal resources to dedicate to the maintenance and updates. The team decided to package their data into Tabular Data Packages and identified Zenodo as a good option for openly hosting that packaged data. Catalyst also recognized that most users only want to download the outputs and use them directly, and did not care about reproducing the data processing pipeline themselves, but it was still important to provide the processing pipeline code publicly to support transparency and reproducibility. Therefore, in this Pilot, they focused on transitioning their existing ETL pipeline from outputting a PostgreSQL database, that was defined using SQLAlchemy, to outputting datapackages which could then be archived publicly on Zenodo. Importantly, they needed this pipeline to maintain the metadata, information about data type, and database structural information that had already been accumulated. This rich metadata needed to be stored alongside the data itself, so future users could understand where the data came from and understand its meaning. The Catalyst team used Tabular Data Packages to record and store this metadata (see the code here: https://github.com/catalyst-cooperative/pudl/blob/master/src/pudl/load/metadata.py). Another complicating factor is that many of the PUDL datasets are fairly entangled with each other. The PUDL team ideally wanted users to be able to pick and choose which datasets they actually wanted to download and use without requiring them to download it all (currently about 100GB of data when uncompressed). However, they were worried that if single datasets were downloaded, the users might miss that some of the datasets were meant to be used together. So, the PUDL team created information, which they call “glue”,  that shows which datasets are linked together and that should ideally be used in tandem.  The cumulation of this Pilot was a release of the PUDL data (access it here – https://zenodo.org/record/3672068 and read the corresponding documentation here – https://catalystcoop-pudl.readthedocs.io/en/v0.3.2/), which includes integrated data from the EIA Form 860, EIA Form 923, The EPA Continuous Emissions Monitoring System (CEMS), The EPA Integrated Planning Model (IPM), and FERC Form 1.

What problems were encountered during this Pilot?

One issue that the group encountered during the Pilot was that the data types available in Postgres are substantially richer than those natively in the Tabular Data Package standard. However, this issue is an endemic problem of wanting to work with several different platforms, and so the team compromised and worked with the least common denominator.  In the future, PUDL might store several different sets of data types for use in different contexts, for example, one for freezing the data out into data packages, one for SQLite, and one for Pandas.  Another problem encountered during the Pilot resulted from testing the limits of the draft Tabular Data Package specifications. There were aspects of the specifications that the Catalyst team assumed were fully implemented in the reference (Python) implementation of the Frictionless toolset, but were in fact still works in progress. This work led the Frictionless team to start a documentation improvement project, including a revision of the specifications website to incorporate this feedback.  Through the pilot, the teams worked to implement new Frictionless features, including the specification of composite primary keys and foreign key references that point to external data packages. Other new Frictionless functionality that was created with this Pilot included partitioning of large resources into resource groups in which all resources use identical table schemas, and adding gzip compression of resources. The Pilot also focused on implementing more complete validation through goodtables, including bytes/hash checks, foreign keys checks, and primary keys checks, though there is still more work to be done here.

Future Directions

A common problem with using publicly available energy data is that the federal agencies creating the data do not use version control or maintain change logs for the data they publish, but they do frequently go back years after the fact to revise or alter previously published data — with no notification. To combat this problem, Catalyst is using data packages to encapsulate the raw inputs to the ETL process. They are setting up a process which will periodically check to see if the federal agencies’ posted data has been updated or changed, create an archive, and upload it to Zenodo. They will also store metadata in non-tabular data packages, indicating which information is stored in each file (year, state, month, etc.) so that there can be a uniform process of querying those raw input data packages. This will mean the raw inputs won’t have to be archived alongside every data release. Instead one can simply refer to these other versioned archives of the inputs. Catalyst hopes these version controlled raw archives will also be useful to other researchers. Another next step for Catalyst will be to make the ETL and new dataset integration more modular to hopefully make it easier for others to integrate new datasets. For instance, they are planning on integrating the EIA 861 and the ISO/RTO LMP data next. Other future plans include simplifying metadata storage, using Docker to containerize the ETL process for better reproducibility, and setting up a Pangeo  instance for live interactive data access without requiring anyone to download any data at all. The team would also like to build visualizations that sit on top of the database, making an interactive, regularly updated map of US coal plants and their operating costs, compared to new renewable energy in the same area. They would also like to visualize power plant operational attributes from EPA CEMS (e.g., ramp rates, min/max operating loads, relationship between load factor and heat rate, marginal additional fuel required for a startup event…).  Have you used PUDL? The team would love to hear feedback from users of the published data so that they can understand how to improve it, based on real user experiences. If you are integrating other US energy/electricity data of interest, please talk to the PUDL team about whether they might want to integrate it into PUDL to help ensure that it’s all more standardized and can be maintained long term. Also let them know what other datasets you would find useful (E.g. FERC EQR, FERC 714, PHMSA Pipelines, MSHA mines…).  If you have questions, please ask them on GitHub (https://github.com/catalyst-cooperative/pudl) so that the answers will be public for others to find as well.

Tracking the Trade of Octopus (and Packaging the Data)

- March 13, 2020 in Frictionless Data, Open Knowledge

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/. By Lily Zhao

Introduction

When I started graduate school, I was shocked to learn that seafood is actually the most internationally traded food commodity in the world. In fact, the global trade in fish is worth more than the trades of tea, coffee and sugar combined (Fisheries FAO, 2006). However, for many developing countries being connected to the global seafood market can be a double-edged sword. It is true global trade has the potential to redistribute some wealth and improve the livelihoods of fishers and traders in these countries. But it can also promote illegal trade and overfishing, which can harm the future sustainability of a local food source. Over the course of my master’s degree, I developed a passion for studying these issues, which is why I am excited to share with you my experience turning some of the data my collaborators into a packaged dataset using the Open Knowledge Foundation’s Datapackage tool. These data provide a snapshot into the global market for octopus and how it is traded throughout and between Kenya, Tanzania and Mozambique before heading to European markets. This research project was an international collaboration between the Stockholm Resilience Centre in Sweden, the National Institute for Medical Research, of Tanzania, Pwani University in Kilifi, Kenya and the School of Marine and Environmental Affairs at the University of Washington. These data eventually became my master’s thesis and this data package will complement a forthcoming publication of our findings. Specifically, these data are the prices and quantities at which middlemen in Tanzania and Kenya reported buying and selling octopus. These data are exciting because they not only inform our understanding of who is benefiting from the trade of octopus by also could assist in improving the market price octopus in Tanzania. This is because value chain information can help Tanzania’s octopus fishery along its path to Marine Stewardship Council seafood certification. Seafood that gets the Marine Stewardship Council Label gains a certain amount of credibility which in turn can increase profit. For developing countries, this seafood label can provide a monetary incentive for improving fisheries management. But before Tanzania’s octopus fishery can get certified, they will need to prove they can trace the flow of their octopus supply chain, and manage it sustainably. We hope that this packaged dataset will ultimately inform this effort.

Getting the data

To gather the data my field partner Chris and I went to 10 different fishing communities like this one. mtwara

Middlemen buy and sell seafood in Mtwara, Tanzania.

We went on to interview all the major exporters of octopus in both Tanzania and Kenya and spoke with company agents and octopus traders who bought their octopus from 570 different fishermen. With these interviews were able to account for about 95% of East Africa’s international octopus market share. Octopus

My research partner- Chris Cheupe, and I at an octopus collection point.

Creating the Data Package

The datapackage tool was created by the Open Knowledge Foundation to compile our data and metadata in a compact unit, making it easier and more efficient for others to access. You can create the data package using the online platform or using the Python or R programming software libraries. I had some issues using the R package instead of the online tool initially, which may have been related to the fact that the original data file was not utf-8 encoded. But stay tuned! For now, I made my datapackage using the Data Package Creator online tool. The tool helped me create a schema that outlines the data’s structure including a description of each column. The tool also helps you outline the metadata for the dataset as a whole, including information like the license and author. Our dataset has a lot of complicated columns and the tool gave me a streamlined way to describe each column via the schema. Afterwords, I added the metadata using the lefthand side of the browser tool and checked to make sure that the data package was valid!   valid data

The green bar at the top of the screenshot indicates validity

If the information you provide for each column does not match the data within the columns the package will not validate and instead, you will get an error like this: invalid data  

The red bar at the top of the screenshot indicates invalidity

Checkout my final datapackage by visiting my github repository!

Reference:

Fisheries, F. A. O. (2006). The state of world fisheries and aquaculture 2006.

Announcing the 2020 Frictionless Data Tool Fund

- March 2, 2020 in Frictionless Data

Apply for a mini-grant to build an open source tool for reproducible research using Frictionless Data tooling, specs, and code base.

Today, Open Knowledge Foundation is launching the second round of the Frictionless Data Tool Fund, a mini-grant scheme offering grants of $5,000 to support individuals or organisations in developing an open tool for reproducible science or research built using the Frictionless Data specifications and software. We welcome submissions of interest until 17th May 2020. The Tool Fund is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts. At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata.  With this announcement we are looking for individuals or organizations of scientists, researchers, developers, or data wranglers to build upon our existing open source tools and code base to create novel tooling for reproducible research. We will prioritize tools focusing on the following fields/domains of science: biology, genetics, neuroscience, ecology, geology, and bioinformatics. The fund will be accepting submissions from now until 1st May, with projects starting mid-June and to be completed by the end of the year. This builds on the success of the 2019 Tool Fund, which funded the creation of four tools: a tool to convert the biodiversity DarwinCore Archive into Frictionless data packages; a tool that bundles Open Referral data as data packages; a tool to export Neuroscience Experiments System data as data packages; and a tool to import and export data packages in Google Sheets. For this year’s Tool Fund, we would like the community to work on tools that can make a difference to researchers and scientists in the following domains: biology, genetics, neuroscience, ecology, geology, and bioinformatics.  Applications can be submitted by filling out this form by 1st May. The Frictionless Data team will notify all applicants whether they have been successful or not at the very latest by mid-June. Successful candidates will then be invited for interviews before the final decision is given. We will base our choice on evidence of technical capabilities and also favour applicants who demonstrate an interest in practical use of the Frictionless Data Specifications. Preference will also be given to applicants who show an interest working with and maintaining these tools going forward. For more questions on the fund, speak directly to us on our forum, on our Gitter chat or email us at frictionlessdata@okfn.org.

Frictionless Data Pipelines for Ocean Science

- February 10, 2020 in Frictionless Data, Open Knowledge

This blog post describes a Frictionless Data Pilot with the Biological and Chemical Oceanography Data Management Office (BCO-DMO). Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by the BCO-DMO team members Adam Shepherd, Amber York, Danie Kinkade, and development by Conrad Schloer.   Scientific research is implicitly reliant upon the creation, management, analysis, synthesis, and interpretation of data. When properly stewarded, data hold great potential to demonstrate the reproducibility of scientific results and accelerate scientific discovery. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) is a publicly accessible earth science data repository established by the National Science Foundation (NSF) for the curation of biological, chemical, and biogeochemical oceanographic data from research in coastal, marine, and laboratory environments. With the groundswell surrounding the FAIR data principles, BCO-DMO recognized an opportunity to improve its curation services to better support reproducibility of results, while increasing process efficiencies for incoming data submissions. In 2019, BCO-DMO worked with the Frictionless Data team at Open Knowledge Foundation to develop a web application called Laminar for creating Frictionlessdata Data Package Pipelines that help data managers process data efficiently while recording the provenance of their activities to support reproducibility of results.  
The mission of BCO-DMO is to provide investigators with data management services that span the full data lifecycle from data management planning, to data publication, and archiving.

BCO-DMO provides free access to oceanographic data through a web-based catalog with tools and features facilitating assessment of fitness for purpose. The result of this effort is a database containing over 9,000 datasets from a variety of oceanographic and limnological measurements including those from: in situ sampling, moorings, floats and gliders, sediment traps; laboratory and mesocosm experiments; satellite images; derived parameters and model output; and synthesis products from data integration efforts. The project has worked with over 2,600 data contributors representing over 1,000 funded projects.  As the catalog of data holdings continued to grow in both size and the variety of data types it curates, BCO-DMO needed to retool its data infrastructure with three goals. First, to improve the transportation of data to, from, and within BCO-DMO’s ecosystem. Second, to support reproducibility of research by making all curation activities of the office completely transparent and traceable. Finally, to improve the efficiency and consistency across data management staff. Until recently, data curation activities in the office were largely dependent on the individual capabilities of each data manager. While some of the staff were fluent in Python and other scripting languages, others were dependent on in-house custom developed tools. These in-house tools were extremely useful and flexible, but they were developed for an aging computing paradigm grounded in physical hardware accessing local data resources on disk. While locally stored data is still the convention at BCO-DMO, the distributed nature of the web coupled with the challenges of big data stretched this toolset beyond its original intention. 
In 2015, we were introduced to the idea of data containerization and the Frictionless Data project in a Data Packages BoF at the Research Data Alliance conference in Paris, France. After evaluating the Frictionless Data specifications and tools, BCO-DMO developed a strategy to underpin its new data infrastructure on the ideas behind this project.
While the concept of data packaging is not new, the simplicity and extendibility of the Frictionless Data implementation made it easy to adopt within an existing infrastructure. BCO-DMO identified the Data Package Pipelines (DPP) project in the Frictionless Data toolset as key to achieving its data curation goals. DPP implements the philosophy of declarative workflows which trade code in a specific programming language that tells a computer how a task should be completed, for imperative, structured statements that detail what should be done. These structured statements abstract the user writing the statements from the actual code executing them, and are useful for reproducibility over long periods of time where programming languages age, change or algorithms improve. This flexibility was appealing because it meant the intent of the data manager could be translated into many varying programming (and data) languages over time without having to refactor older workflows. In data management, that means that one of the languages a DPP workflow captures is provenance – a common need across oceanographic datasets for reproducibility. DPP Workflows translated into records of provenance explicitly communicates to data submitters and future data users what BCO-DMO had done during the curation phase. Secondly, because workflow steps need to be interpreted by computers into code that carries out the instructions, it helped data management staff converge on a declarative language they could all share. This convergence meant cohesiveness, consistency, and efficiency across the team if we could implement DPP in a way they could all use.  In 2018, BCO-DMO formed a partnership with Open Knowledge Foundation (OKF) to develop a web application that would help any BCO-DMO data manager use the declarative language they had developed in a consistent way. Why develop a web application for DPP? As the data management staff evaluated DPP and Frictionless Data, they found that there was a learning curve to setting up the DPP environment and a deep understanding of the Frictionlessdata ‘Data Package’ specification was required. The web application abstracted this required knowledge to achieve two main goals: 1) consistently structured Data Packages (datapackage.json) with all the required metadata employed at BCO-DMO, and 2) efficiencies of time by eliminating typos and syntax errors made by data managers.  Thus, the partnership with OKF focused on making the needs of scientific research data a possibility within the Frictionless Data ecosystem of specs and tools. 
Data Package Pipelines is implemented in Python and comes with some built-in processors that can be used in a workflow. BCO-DMO took its own declarative language and identified gaps in the built-in processors. For these gaps, BCO-DMO and OKF developed Python implementations for the missing declarations to support the curation of oceanographic data, and the result was a new set of processors made available on Github.
Some notable BCO-DMO processors are: boolean_add_computed_field – Computes a new field to add to the data whether a particular row satisfies a certain set of criteria.
Example: Where Cruise_ID = ‘AT39-05’ and Station = 6, set Latitude to 22.1645. convert_date – Converts any number of fields containing date information into a single date field with display format and timezone options. Often data information is reported in multiple columns such as `year`, `month`, `day`, `hours_local_time`, `minutes_local_time`, `seconds_local_time`. For spatio-temporal datasets, it’s important to know the UTC date and time of the recorded data to ensure that searches for data with a time range are accurate. Here, these columns are combined to form an ISO 8601-compliant UTC datetime value. convert_to_decimal_degrees –  Convert a single field containing coordinate information from degrees-minutes-seconds or degrees-decimal_minutes to decimal_degrees. The standard representation at BCO-DMO for spatial data conforms to the decimal degrees specification.
reorder_fields –  Changes the order of columns within the data. This is a convention within the oceanographic data community to put certain columns at the beginning of tabular data to help contextualize the following columns. Examples of columns that are typically moved to the beginning are: dates, locations, instrument or vessel identifiers, and depth at collection.  The remaining processors used by BCO-DMO can be found at https://github.com/BCODMO/bcodmo_processors

How can I use Laminar?

In our collaboration with OKF, BCO-DMO developed use cases based on real-world data submissions. One such example is a recent Arctic Nitrogen Fixation Rates dataset.   Arctic dataset  The original dataset shown above needed the following curation steps to make the data more interoperable and reusable:
  • Convert lat/lon to decimal degrees
  • Add timestamp (UTC) in ISO format
  • ‘Collection Depth’ with value “surface” should be changed to 0
  • Remove parenthesis and units from column names (field descriptions and units captured in metadata).
  • Remove spaces from column names
The web application, named Laminar, built on top of DPP helps Data Managers at BCO-DMO perform these operations in a consistent way. First, Laminar prompts us to name and describe the current pipeline being developed, and assumes that the data manager wants to load some data in to start the pipeline, and prompts for a source location. Laminar After providing a name and description of our DPP workflow, we provide a data source to load, and give it the name, ‘nfix’.  In subsequent pipeline steps, we refer to ‘nfix’ as the resource we want to transform. For example, to convert the latitude and longitude into decimal degrees, we add a new step to the pipeline, select the ‘Convert to decimal degrees’ processor, a proxy for our custom processor convert_to_decimal_degrees’, select the ‘nfix’ resource, select a field form that ‘nfix’ data source, and specify the Python regex pattern identifying where the values for the degrees, minutes and seconds can be found in each value of the latitude column. processor step Similarly, in step 7 of this pipeline, we want to generate an ISO 8601-compliant UTC datetime value by combining the pre-existing ‘Date’ and ‘Local Time’ columns. This step is depicted below: date processing step After the pipeline is completed, the interface displays all steps, and lets the data manager execute the pipeline by clicking the green ‘play’ button at the bottom. This button then generates the pipeline-spec.yaml file, executes the pipeline, and can display the resulting dataset. all steps   data The resulting DPP workflow contained 223 lines across this 12-step operation, and for a data manager, the web application reduces the chance of error if this pipelines was being generated by hand. Ultimately, our work with OKF helped us develop processors that follow the DPP conventions.
Our goal for the pilot project with OKF was to have BCO-DMO data managers using the Laminar for processing 80% of the data submissions we receive. The pilot was so successful, that data managers have processed 95% of new data submissions to the repository using the application.
This is exciting from a data management processing perspective because the use of Laminar is more sustainable, and acted to bring the team together to determine best strategies for processing, documentation, etc. This increase in consistency and efficiency is welcomed from an administrative perspective and helps with the training of any new data managers coming to the team.  The OKF team are excellent partners, who were the catalysts to a successful project. The next steps for BCO-DMO are to build on the success of The Fricitonlessdata  Data Package Pipelines by implementing the Frictionlessdata Goodtables specification for data validation to help us develop submission guidelines for common data types. Special thanks to the OKF team – Lilly Winfree, Evgeny Karev, and Jo Barrett. 

Frictionless Data for Reproducible Research Call for Pilot Collaborations with Scientists

- January 20, 2020 in Frictionless Data

Have you ever looked back at a graph of fluorescence change in neurons or gene expression data in C. elegans from years ago and wondered how exactly you got that result? Would you have enough findable notes at hand to repeat that experiment? Do you have a quick, repeatable method for preparing your data to be published with your manuscripts (as required by many journals and funders)? If these questions give you pause, we are interested in helping you!   For many data users, getting insight from data is not always a straightforward process. Data is often hard to find, archived in difficult to use formats, poorly structured or incomplete. These issues create friction and make it difficult to use, publish, and share data. The Frictionless Data initiative aims to reduce friction in working with data, with a goal to make it effortless to transport data among different tools and platforms for further analysis.  
The Frictionless Data for Reproducible Research project, part of the Open Knowledge Foundation and funded by the Sloan Foundation, is focused on helping researchers and the research community resolve data workflow issues.

Over the last several  years, Frictionless Data has produced specifications, software, and best practices that address identified needs for improving data-driven research such as generalized, standard metadata formats, interoperable data, and open-source tooling for data validation.    For researchers, Frictionless Data tools, specifications, and software can be used to:
    • Improve the quality of your dataset
    • Quickly find and fix errors in your data
    • Put your data collection and relevant information that provides context about your data in one container before you share it
    • Write a schema – a blueprint that tells others how your data is structured, and what type of content is to be expected in it
    • Facilitate data reuse by creating machine-readable metadata
    • Make your data more interoperable so you can import it into various tools like Excel, R, or Python
    • Publish your data to repositories more easily
    • See our open source repositories here
    • Read more about how to get started with our Field Guide tutorials
  Importantly, these tools can be used on their own, or adapted into your own personal and organisational workflows. For instance, neuroscientists can implement Frictionless Data tooling and specs can help keep track of imaging metadata from the microscope to analysis software to publication; optimizing ephys data workflow from voltage recording, to tabular data, to analyzed graph; or to make data more easily shareable for smoother publishing with a research article.
We want to learn about your multifacet workflow and help make your data more interoperable between the various formats and tools you use.
We are looking for researchers and research-related groups to join Pilots, and are particularly keen to work with: scientists creating data, data managers in a research group, statisticians and data scientists, data wranglers in a database, publishers, and librarians helping researchers manage their data or teaching data best practices. The primary goal of this work will be to work collaboratively with scientists and scientific data to enact exemplar data practice, supported by Frictionless Data specifications and software, to deliver on the promise of data-driven, reproducible research. We will work with you, integrating with your current tools and methodologies, to enhance your workflows and provide increased efficiency and accuracy of your data-driven research.     Want to know more? Through our past Pilots, we worked directly with organisations to solve real problems managing data:
  • In an ongoing Pilot with the Biological and Chemical Oceanography Data Management Office (BCO-DMO), we helped BCO-DMO develop a data management UI, called Laminar, which incorporates Frictionless Data Package Pipelines on the backend. BCO-DMO’s data managers are now able to receive data in various formats, import the data into Laminar, and perform several pipeline processes, and then host the clean, transformed data for other scientists to (re)use.  The next steps in the Pilot are to incorporate GoodTables into the Laminar pipeline to validate the data as it is processed. This will help ensure data quality and will also improve the processing experience for the data managers.
  • In a Pilot with the University of Cambridge, we worked with Stephen Eglen to capture complete metadata about retinal ganglion cells in a data package. This metadata included the type of ganglion cell, the species, the radius of the soma, citations, and raw images. 
  • Collaborating with the Cell Migration Standard Organization (CMSO), we investigated the standardization of cell tracking data. CMSO used the Tabular Data Package to make it easy to import their data into a Pandas dataframe (in Python) to allow for dynamic data visualization and analysis.
To find out more about Frictionless data visit frictionlessdata.io or email the team frictionlessdata@okfn.org.

Frictionless Data Tool Fund update: Shelby Switzer and Greg Bloom, Open Referral

- January 15, 2020 in Data Package, Frictionless Data, Open Knowledge

This blogpost is part of a series showcasing projects developed during the 2019 Frictionless Data Tool Fund. The 2019 Frictionless Data Tool Fund provided four mini-grants of $5,000 to support individuals or organisations in developing an open tool for reproducible research built using the Frictionless Data specifications and software. This fund is part of the Frictionless Data for Reproducible Research project, which is funded by the Sloan Foundation. This project applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts.     Open Referral Logo   Open Referral creates standards for health, human, and social services data – the data found in community resource directories used to help find resources for people in need. In many organisations, this data lives in a multitude of formats, from handwritten notes to Excel files on a laptop to Microsoft SQL databases in the cloud. For community resource directories to be maximally useful to the public, this disparate data must be converted into an interoperable format. Many organisations have decided to use Open Referral’s Human Services Data Specification (HSDS) as that format. However, to accurately represent this data, HSDS uses multiple linked tables, which can be challenging to work with. To make this process easier, Greg Bloom and Shelby Switzer from Open Referral decided to implement datapackage bundling of their CSV files using the Frictionless Data Tool Fund.  In order to accurately represent the relationships between organisations, the services they provide, and the locations they are offered, Open Referral aims to use their Human Service Data Specification (HSDS) makes sense of disparate data by linking multiple CSV files together by foreign keys. Open Referral used Frictionless Data’s datapackage to specify the tables’ contents and relationships in a single machine-readable file, so that this standardised format could transport HSDS-compliant data in a way that all of the teams who work with this data can use: CSVs of linked data.  In the Tool Fund, Open Referral worked on their HSDS Transformer tool, which enables a group or person to transform data into an HSDS-compliant data package, so that it can then be combined with other data or used in any number of applications. The HSDS-Transformer is a Ruby library that can be used during the extract, transform, load (ETL) workflow of raw community resource data. This library extracts the community resource data, transforms that data into HSDS-compliant CSVs, and generates a datapackage.json that describes the data output. The Transformer can also output the datapackage as a zip file, called HSDS Zip, enabling systems to send and receive a single compressed file rather than multiple files. The Transformer can be spun up in a docker container — and once it’s live, the API can deliver a payload that includes links to the source data and to the configuration file that maps the source data to HSDS fields. The Transformer then grabs the source data and uses the configuration file to transform the data and return a zip file of the HSDS-compliant datapackage.  HSDS demo app

Example of a demo app consuming the API generated from the HSDS Zip

The Open Referral team has also been working on projects related to the HSDS Transformer and HSDS Zip. For example, the HSDS Validator checks that a given datapackage of community service data is HSDS-compliant. Additionally, they have used these tools in the field with a project in Miami. For this project, the HSDS Transformer was used to transform data from a Microsoft SQL Server into an HSDS Zip. Then that zipped datapackage was used to populate a Human Services Data API with a generated developer portal and OpenAPI Specification.   Further, as part of this work, the team also contributed to the original source code for the datapackage-rb Ruby gem. They added a new feature to infer a datapackage.json schema from a given set of CSVs, so that you can generate the json file automatically from your dataset. Greg and Shelby are eager for the Open Referral community to use these new tools and provide feedback. To use these tools currently, users should either be a Ruby developer who can use the gem as part of another Ruby project, or be familiar enough with Docker and HTTP APIs to start a Docker container and make an HTTP request to it. You can use the HSDS Transformer as a Ruby gem in another project or as a standalone API. In the future, the project might expand to include hosting the HSDS Transformer as a cloud service that anyone can use to transform their data, eliminating many of these technical requirements. Interested in using these new tools? Open Referral wants to hear your feedback. For example, would it be useful to develop an extract-transform-load API, hosted in the cloud, that enables recurring transformation of nonstandardised human service directory data source into an HSDS-compliant datapackage? You can reach them via their GitHub repos. Further reading: openreferral.org Repository: https://github.com/openreferral/hsds-transformer HSDS Transformer: https://openreferral.github.io/hsds-transformer/ 

Neuroscience Experiments System Frictionless Tool

- December 16, 2019 in Frictionless Data, Open Knowledge

This blog is part of a series showcasing projects developed during the 2019 Frictionless Data Tool Fund.  The 2019 Frictionless Data Tool Fund provided four mini-grants of $5,000 to support individuals or organisations in developing an open tool for reproducible research built using the Frictionless Data specifications and software. This fund is part of the Frictionless Data for Reproducible Research project, which is funded by the Sloan Foundation. This project applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts.  

NES logo

Neuroscience Experiments System Frictionless Data Incorporation, by the Technology Transfer team of the Research, Innovation and Dissemination Center for Neuromathematics.

  The Research, Innovation and Dissemination Center for Neuromathematics (RIDC NeuroMat) is a research center established in 2013 by the São Paulo Research Foundation (FAPESP) at the University of São Paulo, in Brazil. A core mission of NeuroMat is the development of open-source computational tools to aid in scientific dissemination and advance open knowledge and open science. To this end, the team has created the Neuroscience Experiments System (NES), which is an open-source tool to assist neuroscience research laboratories in routine procedures for data collection. To more effectively understand the function and treatment of brain pathologies, NES aids in recording data and metadata from various experiments, including clinical data, electrophysiological data, and fundamental provenance information. NES then stores that data in a structured way, allowing researchers to seek and share data and metadata from those neuroscience experiments.  For the 2019 Tool Fund, the NES team, particularly João Alexandre Peschanski, Cassiano dos Santos and Carlos Eduardo Ribas, proposed to adapt their existing export component to conform to the Frictionless Data specifications.   Public databases are seen as crucial by many members of the neuroscience community as a means of moving science forward. However, simply opening up data is not enough; it should be created in a way that can be easily shared and used. For example, data and metadata should be readable by both researchers and machines, yet they typically are not. When the NES team learned about Frictionless Data, they were interested in trying to implement the specifications to help make the data and metadata in NES machine readable.  For them, the advantage of the Frictionless Data approach was to be able to standardize data opening and sharing within the neuroscience community.   Before the Tool Fund, NES had an export component that set up a file with folders and documents with information on an entire experiment (including data collected from participants, device metadata, questionnaires, etc. ), but they wanted to improve this export to be more structured and open. By implementing Frictionless Data specifications, the resulting export component includes the Data Package (datapackage.json) and the folders/files inside the archive, with a root folder called data. With this new “frictionless” export component, researchers can transport and share their export data with other researchers in a recognized open standard format (the Data Package), facilitating the understanding of that exported data. They have also implemented Goodtables into the unit tests to check data structure.   The RIDC NeuroMat team’s expectation is that many researchers,  particularly neuroscientists and experimentalists, will have an interest in using the freely available NES tool. With the anonymization of sensitive information, the data collected using NES can be publicly available through the NeuroMat Open Database, allowing any researcher to reproduce the experiment or simply use the data in a different study. In addition to storing collected experimental data and being a tool for guiding and documenting all the steps involved in a neuroscience experiment, NES has an integration with the Neuroscience Experiment Database, another NeuroMat project, based on a REST API, where NES users can send their experiments to become publicly available for other researchers to reproduce them or to use as inspiration for further experiments. Screenshot of the export of an experiment: NES export   Screenshot of the export of data on participants:   Picture of a hypothetical export file tree of type Per Experiment after the Frictionless Data implementation: NES data   Further reading: Repository: https://github.com/neuromat/nes User manual: https://nes.readthedocs.io/en/latest/ NeuroMat blog: https://neuromat.numec.prp.usp.br/ Post on NES at the NeuroMat blog: https://neuromat.numec.prp.usp.br/content/a-pathway-to-reproducible-science-the-neuroscience-experiments-system/

Announcing Frictionless Data Joint Stewardship

- December 12, 2019 in Frictionless Data, Open Knowledge

We are pleased to announce joint stewardship of Frictionless Data between the Open Knowledge Foundation and Datopian. While this collaboration already exists informally, we are solidifying how we are leading together on future Frictionless Data projects and goals.   What does this mean for users of Frictionless Data software and specifications?   First, you will continue to see a consistent level of activity and support from Open Knowledge Foundation, with a particular focus on the application of Frictionless Data for reproducible research, as part of our three-year project funded by the Sloan Foundation. This also includes specific contributions in the development of the Frictionless Data specifications under the leadership of Rufus Pollock, Datopian President and Frictionless Data creator, and Paul Walsh, Datopian CEO and long-time contributor to the specifications and software.   Second, there will be increased activity in software development around the specifications, with a larger team across both organisations contributing to key codebases such as Good Tables, and the various integrations with backend storage systems such as Elasticsearch, BigQuery, and PostgreSQL, and data science tooling such as PandasAdditionally, based on their CKAN commercial services work, and co-stewardship of the CKAN project, Datopian look forward to providing more integrations of Frictionless Data with CKAN, building on existing work done at the Open Knowledge Foundation.    Our first joint project is redesigning the Frictionless Data website. Our goal is to make the project more understandable, usable, and user-focused. At this point, we are actively seeking user input, and are requesting interviews to help inform the new design. Have you used our website and are interested in having your opinion heard? Please get in touch to give us your ideas and feedback on the site. Focusing on user needs is a top goal for this project.   Ultimately, we are focused on leading the project openly and transparently, and are excited by the opportunities that clarification of the leadership of the project will provide. We want to emphasize that the Frictionless Data project is community focused, meaning that we really value to input and participation of our community of users. We encourage you to reach out to us on Discuss, in Gitter, or open issues in GitHub with your ideas or problems. Datopian   OKF logo   

Frictionless DarwinCore Tool by André Heughebaert

- December 9, 2019 in Frictionless Data, Open Knowledge, Open Research, Open Science, Open Software, Technical

This blog is part of a series showcasing projects developed during the 2019 Frictionless Data Tool Fund.  The 2019 Frictionless Data Tool Fund provided four mini-grants of $5,000 to support individuals or organisations in developing an open tool for reproducible research built using the Frictionless Data specifications and software. This fund is part of the Frictionless Data for Reproducible Research project, which is funded by the Sloan Foundation. This project applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts.   logo

Frictionless DarwinCore, developed by André Heughebaert

  André Heughebaert is an open biodiversity data advocate in his work and his free time. He is an IT Software Engineer at the Belgian Biodiversity Platform and is also the Belgian GBIF (Global Biodiversity Information Facility) Node manager. During this time, he has worked with the Darwin Core Standards and Open Biodiversity data on a daily basis. This work inspired him to apply for the Tool Fund, where he has developed a tool to convert DarwinCore Archives into Frictionless Data Packages.   The DarwinCore Archive (DwCA) is a standardised container for biodiversity data and metadata largely used amongst the GBIF community, which consists of more than 1,500 institutions around the world. The DwCA is used to publish biodiversity data about observations, collections specimens, species checklists and sampling events. However, this domain specific standard has some limitations, mainly the star schema (core table + extensions), rules that are sometimes too permissive, and a lack of controlled vocabularies for certain terms. These limitations encouraged André to investigate emerging open data standards. In 2016, he discovered Frictionless Data and published his first data package on historical data from 1815 Napoleonic Campaign of Belgium. He was then encouraged to create a tool that would, in part, build a bridge between these two open data ecosystems.   As a result, the Frictionless DarwinCore tool converts DwCA into Frictionless Data Packages, and also gives access to the vast Frictionless Data software ecosystem enabling constraints validation and support of a fully relational data schema.  Technically speaking, the tool is implemented as a Python library, and is exposed as a Command Line Interface. The tool automatically converts: project architecture   * DwCA data schema into datapackage.json * EML metadata into human readable markdown readme file * data files are converted when necessary, this is when default values are described The resulting zip file complies to both DarwinCore and Frictionless specifications.    André hopes that bridging the two standards will give an excellent opportunity for the GBIF community to provide open biodiversity data to a wider audience. He says this is also a good opportunity to discover the Frictionless Data specifications and assess their applicability to the biodiversity domain. In fact, on 9th October 2019, André presented the tool at a GBIF Global Nodes meeting. It was perceived by the nodes managers community as an exploratory and pioneering work. While the command line interface offers a simple user interface for non-programmers, others might prefer the more flexible and sophisticated Python API. André encourages anyone working with DarwinCore data, including all data publishers and data users of GBIF network, to try out the new tool. 
“I’m quite optimistic that the project will feed the necessary reflection on the evolution of our biodiversity standards and data flows.”

To get started, installation of the tool is done through a single pip install command (full directions can be found in the project README). Central to the tool is a table of DarwinCore terms linking a Data Package type, format and constraints for every DwC term. The tool can be used as CLI directly from your terminal window or as Python Library for developers. The tool can work with either locally stored or online DwCA. Once converted to Tabular DataPackage, the DwC data can then be ingested and further processed by software such as Goodtables, OpenRefine or any other Frictionless Data software. André has aspirations to take the Frictionless DarwinCore tool further by encapsulating the tool in a web-service that will directly deliver Goodtables reports from a DwCA, which will make it even more user friendly. Additional ideas for further improvement would be including an import pathway for DarwinCore data into Open Refine, which is a popular tool in the GBIF community. André’s long term hope is that the Data Package will become an optional format for data download on GBIF.org.  workflow Further reading: Repository: https://github.com/frictionlessdata/FrictionlessDarwinCore Project blog: https://andrejjh.github.io/fdwc.github.io/

Meet Lily Zhao, one of our Frictionless Data for Reproducible Research Fellows

- November 18, 2019 in Frictionless Data

The Frictionless Data for Reproducible Research Fellows Programme is training early career researchers to become champions of the Frictionless Data tools and approaches in their field. Fellows will learn about Frictionless Data, including how to use Frictionless Data tools in their domains to improve reproducible research workflows, and how to advocate for open science. Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content. I am thrilled to be joining the Open Knowledge Foundation community as a Frictionless Data fellow. I am an interdisciplinary marine scientist getting my PhD at in the Ocean Recoveries Lab at the University of California Santa Barbara. I study how coral reefs, small-scale fisheries, and coastal communities are affected by environmental change and shifting market availability. In particular, I’m interested in how responsible, solutions-oriented science can help build resilience in these systems and improve coastal livelihoods. My current fieldwork is based in Mo’orea, French Polynesia. With an intricate tapestry of social dynamics and strong linkages between it’s terrestrial and marine environments, the island of Mo’orea is representative of the complexity of coral reef social-ecological systems globally. The reefs around Mo’orea are also some of the most highly studied in the world by scientists. In partnership with the University of French Polynesia and the Atiti’a Center, I recently interviewed local associations, community residents and the scientific community to determine how science conducted in Mo’orea can better serve residents of Mo’orea. One of our main findings is the need for increased access to the scientific process and open communication of scientific findings— both of which are tenets of an open science philosophy. I was introduced to open data science just a year ago as part of the Openscapes program– a Mozilla Firefox and National Center for Ecological Analysis and Synthesis initiative. Openscapes connected me to the world of open software and made me acutely aware of the pitfalls of doing things the way I had always done them. This experience made me excited to learn new skills and join the global effort towards reproducible research. With these goals in mind, I was eager to apply for the Frictionless Data Fellowship where I could learn and share new tools for data reproducibility. So far as a Frictionless Data Fellow, I have particularly enjoyed our conversations about “open” for whom? That is: who is open data science open for? And how can we push to increase inclusivity and access within this space?

A little bit about open data in the context of coral reef science

Coral reefs provide food, income, and coastal protection to over 500 million people worldwide. Yet globally, coral reefs are experiencing major disturbances, with many already past their ecological tipping points. Total coral cover (the abundance of coral seen on a reef) is the simplest and most highly used metric of coral resistance and recovery to climate change and local environmental stressors. However, to the detriment of coral reef research, there is not an open global database of coral cover data for researchers to build off of. The effort and money taken to conduct underwater surveys make coral cover data highly coveted and thus these data are often not available publicly. In the future, I hope to collaborate with researchers around the world to build an open, global database of coral cover data. Open datasets and tools, when used by other researchers, show promise in their ability to efficiently propel research forward. In other fields, open science has accelerated the rate of problem-solving and new discoveries. In the face of climate change, the ability to not reinvent the wheel with each new analysis can allow us to conduct reef resilience research at the speed with which coral reef degradation necessitates. Ultimately, I deeply believe that maintaining coral-dominated ecosystems will require: 1) amplification of the perspectives of coastal communities; and 2) open collaboration and data accessibility among scientists worldwide.

Frictionless Data for Reproducible Research Fellows Programme

More on Frictionless Data

The Fellows programme is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. Frictionless Data’s other current projects include the Tool Fund, in which four grantees are developing open source tooling for reproducible research. The Fellows programme will be running until June 2020, and we will post updates to the programme as they progress. • Originally published at http://fellows.frictionlessdata.io/blog/hello-lily/