You are browsing the archive for Frictionless Data.

Clarifying the semantics of data matrices and results tables: a Frictionless Data Pilot

- July 21, 2020 in Frictionless Data, Genomics, pilot

As part of the Frictionless Data for Reproducible Research project, funded by the Sloan Foundation, we have started a Pilot collaboration with the  Data Readiness Group  at the Department of Engineering Science of the University of Oxford; the group will be represented by Dr. Philippe Rocca-Serra, an Associate Member of Faculty. This Pilot will focus on removing the friction in reported scientific experimental results by applying the Data Package specifications. Written with Dr. Philippe Rocca-Serra. Oxford department of engineering science logo Oxford Data Readiness Group Publishing of scientific experimental results is frequently done in ad-hoc ways that are seldom consistent. For example, results are often deposited as idiosyncratic sets of Excel files or tabular files that contain very little structure or description, making them difficult to use, understand and integrate. Interpreting such tables requires human expertise, which is both costly and slow, and leads to low reuse.  Ambiguous tables of results can lead researchers to rerun analysis or computation over the raw data before they understand the published tables. This current approach is broken, does not fit users’ data mining workflows, and limits meta-analysis. A better procedure for organizing and structuring information would reduce unnecessary use of computational resources, which is where the Frictionless Data project comes into play. This Pilot collaboration aims to help researchers publish their results in a more structured, reusable way. In this Pilot, we will use (and possibly extend) Frictionless tabular data packages to devise both generic and specialized templates. These templates can be used to unambiguously report experimental results. Our short term goal from this work is to develop a set of Frictionless Data Packages for targeted use cases where impact is high. We will first focus first on creating templates for statistical comparison results, such as differential analysis, enrichment analysis, high-throughput screens, and univariate comparisons, in genomics research by using the STATO ontology within tabular data packages.  Our longer term goals are that these templates will be incorporated into publishing systems to allow for more clear reporting of results, more knowledge extraction, and more reproducible science.  For instance, we anticipate that this work will allow for increased consistency of table structure in publications, as well as increased data reuse owing to predictable syntax and layout. We also hope this work will ease creation of linked data graphs from table of results due to clarified semantics.  An additional goal is to create code that is compatible with R’s ggplot2 library, which would allow for easy generation of data analysis plots.  To this end, we plan on working with R developers in the future to create a package that will generate Frictionless Data compliant data packages.  This work has recently begun, and will continue throughout the year. We have already met with some challenges, such as working on ways to transform, or normalize, data and ways to incorporate RDF linked data (you can read our related conversations in GitHub). We are also working on how to define a ‘generic’ table layout definition, which is broad enough to be reused in as wide a range of situation as possible. If you are interested in staying up to date on this work, we encourage you to check out these GitHub repositories: https://gitlab.com/datascriptor/datascriptor-fldatapackages and https://github.com/ISA-tools/frictionless-collab. Additionally, we will (virtually) be at the eLife Sprint in September to work on closely related work, which you can read about here: https://sprint.elifesciences.org/data-paper-skeleton-tools-for-life-sciences/.  Throughout this Pilot, we are planning on reaching out to the community to test these ideas and get feedback. Please contact us on GitHub or in Discord if you are interested in contributing.

Reflecting on the first cohort of Frictionless Data Reproducible Research fellows

- June 9, 2020 in Frictionless Data

It is truly bittersweet to say that we are at the end of the first cohort of the Frictionless Data Reproducible Research fellows. Over the past nine months, I have had the pleasure of working with Monica Granados, Selene Yang, Daniel Ouso and Lily Zhao during the fellows programme. Combining their diverse backgrounds (from government data to mapping data, from post-PhD to graduate student), they have spent many hours together learning how to advocate for open science and how to use the Frictionless Data code and tools to make their data more reproducible. Together, they have also written several blogposts, presented a talk and given a workshop. And they did all of this during a global pandemic! I feel lucky to have worked with them, and will be eagerly watching their contributions to the open science space. Each fellow wrote a final blogpost reflecting on their time with the programme. You can read the originals here, and I have also republished them below:

Lily Zhao: Reflecting on my time as a fellow

As one of the inaugural Reproducible Research Fellows of Frictionless Data, I am eager to share my experience of the program with you about working with Sele, Ouso and Monica under the leadership of Lilly Winfree this year. I could not have asked for a better group of individuals to work remotely with. Sele, Ouso, Monica and I spent the last nine months discussing common issues in research reproducibility and delving into the philosophy behind open data science. Together we learned to apply Frictionless Data tools to our own data and mastered techniques for streamlining the reproducibility of our own research process. Lilly was an excellent mentor throughout the program and was always there to help with any issues we ran into. This was also one of my first experiences working entirely remotely on a team across many time zones. Through the use of Google hangout, Zoom and Slack the entire process was easier than I ever thought it could be. It is wonderful that through technology we are able to collaborate across the world easier than ever before. We were also able to give multiple presentations together. Monica and I were joint speakers as part of the csv conference where we talk about our experience as fellows, and our experience using Frictionless Data tools. With so many people on the Zoom call it really felt like were part of a large community. The four of us also led a hands-on workshop introducing the Data Package Creator and GoodTables web interface tools. This was especially fun for me because we used a subset of my French Polynesia interview data as practice data for all workshop participants. Many of the questions asked by participants mirrored questions the four of us had already worked through together, so it was great to be able to share what we had learned with others. I look forward to sharing these tools and the philosophy of open data science throughout my career and am very grateful to the Open Knowledge Foundation for this amazing learning opportunity. If you would like to learn more about my experience in the Frictionless Data Fellows program please feel free to reach out to me personally! Monica, Sele, Lilly, Ouso and I on our most recent conference call :)

Monica Granados: Gimme Dat Data (in a validated Data Package)

As a scientist I collect a lot of data. Especially about animals that live in the water – fish, mussels, crayfish. This data is not only useful to me but it can be used by others to improve the power in their studies, increase geographic range or phylogenetic diversity for example. Prior to the Frictionless Data for Reproducible Research Fellowship, I had my data on GitHub along with a script that would use rcurl to pull the data from the repository. While the repository was accompanied by a README, the file didn’t have much information other than the manuscript which included the data. This structure facilitated reproducibility but not reusability. Conceivably if you wanted to use my data for your own experiments you could have contextualised the data by using the relevant manuscript, but it still would have been a challenge without any metadata, not to mention any potential structural errors you could have encountered that I didn’t catch when I uploaded the data. It was through the introduction of Frictionless Tools, however that I realised that there was more I could do to make my science even more transparent, reproducible and reusable. The fellowship syllabus was structured in such a way that by learning about the tools we learned what the tools were facilitating – better data sharing. The fellows would learn how to use the tool through a self guided lesson and then answer questions on Slack which asked us to interrogate why the tool was built the way it was. These lessons were also supported by calls with the full cohort of fellows where we discussed what we had learned, problems we were encountering as we used the tools with our own data and reviewed papers on open science. The fellowship culminated with a workshop delivered by all four fellows attended by over 40 participants and a presentation at csv,conf. Now when I share data as a data package I know I have validated by tabular data for structural errors and the file contains metadata that contextualises the data. Having the opportunity to be a part of the inaugural cohort has been a wonderful experience. I learned new tools and information that I will take and share for the rest of my career, but also gained new colleagues and open science friends in my fellow fellows.

Daniel Ouso: Better Data, one resource at a time – my fellowship experience

Getting into the Frictionless Data fellowship

My background is largely in molecular biology, particularly infection diagnostics targeting arthropod viruses, bacteria and protozoa. I have a relatively shorter bioinformatics experience, but this is the direction am passionate to build my research occupation in. I first heard about Frictionless data from the African Carpentries instructors’ mailing list. It was the inaugural fellowship call that had been shared by Anelda. I caught it at the nick of time; deadline submission! By the way, you can watch for annual calls and other interesting stuff by following @frictionlessd8a. The call for the second cohort just closed in June and was open from late April. The fellowship starts in September.

On-boarding

Lilly arranged the first-time meeting to usher me into the fellow, after a few email correspondence. I got introduced Jo Barrat who patiently took me through my paces completing logistical preliminaries. I was really looking forward to getting started. The on-boarding enabled acquaintance with the rest of the fellows, awesome people. I was excited!

Context

Overall, the world is in search of and is promoting better ways to work with data, whether it is collecting data or accessibility or novel ways to analyse high-throughput data or dedicated workflows to publish data alongside accustomed scientific publishing or moving/working with data across frameworks or merely storage and security. All these, plus other factors, provide avenues to exhaustively interrogate data in multiple ways, thus promoting improved data usefulness. This has been arguably under-appreciated in times past. Frictionless data, through its Progressive Data Toolkit and with the help of organisations like OKF and funding by Sloan Foundation, is dedicated to alleviating hindrances to some of the aforementioned efforts. People empowerment is a core resource to the #BetterData dream.

The fellowship

An aspect of any research is the collection of data, which is applied to test hypotheses under study. The underlying importance of data, good data for that matter, in research is therefore unquestionable. Approaches to data analysis may differ from field to field, yet there are conventional principles that do not discriminate fields; such are the targets to Frictionless Data. I jump at the opportunity to learn ways to ramp up my data workflow efficiency, with a touch of research openness and reproducibility. The journey took off withdrawing a meticulous roadmap, which I found very helpful, and seem to end with this – sharing my experience. In between exciting things happened. In case one was coming in a little rusty with their basic Python/R, they were catered for early on, though you didn’t exactly need them to use the tools. To say, literally, ZERO programming skills prerequisite. There were a plethora of resources, and help from the fellows, not to mention from the ever welcoming Lilly. The core sections of the Fellowship were prefaced by grasping basic components like the JSON schema data interchange format. Following were the core tools and their specifications. The Data Package Creator tool is impressively emphatic on capturing metadata, a backbone theme for reproducibility. I found Table Schema and Schema specifications initially confusing. Other fellows and I have previously shared on the Data Package Creator and GoodTables, tools for creating and validating data packages respectively. These tools are very progressive, continually incorporating feedback from the community, including fellows, to improve user experience. So don’t be surprised at a few changes since the fellows’ blogs. In fact, a new entrant, which I only knew of recently, is the DataHub tool – “Is a useful solution for sharing datasets, and discovering high-quality datasets that others have produced”. I am yet to check it out. Besides the main focus of the fellowship, I got to learn a lot covering organisational skills and tools such as GitHub projects, Toggl for time-monitoring, general remote working, among others. I got introduced to new communities/initiatives such as PREreview; my first time to participate in open research reviewing. The fellows were awesome to work with and Lilly Winfree provided the best mentorship. Sometimes problems are foreseen and contingencies planned, other times unforeseen surprises rear their heads into our otherwise “perfect” plan. Guess what? You nailed it! COVID-19. Such require adaptability akin to that of the fictional El Professor in the Money Heist. Since we could not organise the in-person seminar and/or workshops as part of the fellowship, we collectively adopted a virtual workshop. It went amazingly well.

What next

Acquired knowledge and skills become more useful when implemented. My goal is to apply them in every opportune opening and to keep learning other integrative tools. Yet, there is also this about knowledge; it is to be spread. I hope to compensate for suspended social sessions and to keep engagement with @frictionlessd8a to continue open and reproducible research advocacy.

Conclusion

Tools that need minimal to no coding experience support well the adoption of good data hygiene practices, more so in places with scanty coding expertise. The FD tools will surely help your workflows with some greasing regardless of your coding proficiency, especially for tabular data. This is especially needful seeing the deluge of data persistently churned out from various sources. Frictionless Data is for everyone working with data; researcher, data scientist or data engineers. The ultimate goal is to work with data in an open and reproducible way, which is consistent with modern scientific research practice. A concerted approach is also key, I am glad to have represented Africa in the fellowship. Do not hesitate to reach out if you think I can be resourceful to your cause.

Sele Yang: A seguir reproduciendo conocimiento!

Termina un gran proceso para la primera cohorte del Frictionless Data for Reproducible Research Fellowship. Un proceso de grandísimos y valiosos aprendizajes que, sí y sólo sí pudieron darse, gracias al trabajo colaborativo entre todas las personas que participaron. En un inicio, recuerdo el gran miedo (que de alguna forma todavía persiste, pero más levemente) de no contar con las habilidades técnicas requeridas para poder llevar a cabo mi proyecto, pero poco a poco fui conociendo y sintiéndome apoyada por mis compañeros y compañeras, que con muchísima paciencia me llevaron de la mano para no perderme en el proceso. Recorrí gracias a este equipo, las playas de M’orea con los datos de Lily, aprendí de formas de investigación por fuera de mi campo de experiencia con Ouso y Mónica. Reconocí el gran trabajo que realizan investigadores e investigadoras para defender el conocimiento más abierto, equitativo y accesible. Si bien nuestro recorrido compartido termina acá, puedo resaltar que a pesar de la crisis que nos llevó a cambiar muchas acciones con COVID-19 durante nuestro programa, logramos encontrarnos aunque fuese virtualmente, no sólo para compartir entre nosotres, sino también con una gran audiencia para nuestro taller sobre el uso de herramientas y metodologías del programa. Una gran actividad para reforzar la importancia de compartir conocimiento, y hacerlo más accesible, mucho más en tiempos de crisis. Agradezco al Open Knowledge Foundation por haber llevado a cabo este programa, y les invito a todas las personas a que recorran la información que produjimos durante estos meses de trabajo. Termino este proceso de aprendizaje con la convicción todavía más fuerte sobre lo necesario qué son los procesos colaborativos que buscan aperturar y democratizar la ciencia y el conocimiento. Mucho más en estos tiempos en los que la colaboración y puesta en común del aprendizaje nos hará más fuertes como sociedad.

Join the Frictionless Data workshop – 20 May

- April 28, 2020 in Frictionless Data

  Join us on 20 May at 4pm UK/10am CDT for a Frictionless Data workshop led by the Reproducible Research Fellows! This 1.5 hour long workshop will cover an introduction to the open source Frictionless Data tools. Participants will learn about data wrangling, including how to document metadata, package data into a datapackage, write a schema to describe data, and validate data. The workshop is suitable for beginners and those looking to learn more about using Frictionless Data.

Everyone is welcome to join, but you must register to attend using this link

The Fellows Programme is part of the Frictionless Data for Reproducible Research project overseen by the Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata. This workshop will be led by the members of the First Cohort of the Fellows Programme: Lily Zhao, Daniel Ouso, Monica Granados, and Selene Yang. You can read more about their work during this programme here: http://fellows.frictionlessdata.io/blog/. Additionally, applications are now open for the Second Cohort of Fellows. Read more about applying here: https://blog.okfn.org/2020/04/27/apply-now-to-become-a-frictionless-data-reproducible-research-fellow/

Apply now to become a Frictionless Data Reproducible Research Fellow

- April 27, 2020 in Frictionless Data

Frictionless Data Logo The Frictionless Data Reproducible Research Fellows Program, supported by the Sloan Foundation, aims to train graduate students, postdoctoral scholars, and early career researchers how to become champions for open, reproducible research using Frictionless Data tools and approaches in their field. Apply today to join the Second Cohort of Frictionless Data Fellows! Fellows will learn about Frictionless Data, including how to use Frictionless tools in their domains to improve reproducible research workflows, and how to advocate for open science.  Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content. In addition to mentorship, we are providing Fellows with stipends of $5,000 to support their work and time during the nine-month long Fellowship. We welcome applications using this form from 27th April until 1 June 2020, with the Fellowship starting in the late Summer. We value diversity and encourage applicants from communities that are under-represented in science and technology, people of colour, women, people with disabilities, and LGBTI+ individuals.

Frictionless Data for Reproducible Research

The Fellowship is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation, and is the second iteration. Frictionless Data aims to reduce the friction often found when working with data, such as when data is poorly structured, incomplete, hard to find, or is archived in difficult to use formats. This project, funded by the Sloan Foundation, applies our work to data-driven research disciplines, in order to help researchers and the research community resolve data workflow issues.  At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata. The Frictionless Data approach aims to address identified needs for improving data-driven research such as generalized, standard metadata formats, interoperable data, and open-source tooling for data validation.

Frictionless Data for Reproducible Research Fellows Programme

Fellowship program

The First Cohort of Fellows ran from Fall 2019 til Summer 2020, and you can read more about their work in the Fellows blog: http://fellows.frictionlessdata.io/blog/.  During the Fellowship, our team will be on hand to work closely with you as you complete the work. We will help you learn Frictionless Data tooling and software, and provide you with resources to help you create workshops and presentations. Also, we will announce Fellows on the project website and will be publishing your blogs and workshops slides within our network channels.  We will provide mentorship on how to work on an Open project, and will work with you to achieve your Fellowship goals.

How to apply

The Fund is open to early career research individuals, such as graduate students and postdoctoral scholars, anywhere in the world, and in any scientific discipline. Successful applicants will be enthusiastic about reproducible research and open science, have some experience with communications, writing, or giving presentations, and have some technical skills (basic experience with Python, R, or Matlab for example), but do not need to be technically proficient. If you are interested, but do not have all of the qualifications, we still encourage you to apply. If you have any questions, please email the team at frictionlessdata@okfn.org, ask a question on the project’s gitter channel, or check out the Fellows FAQ section. Apply soon, and share with your networks!

Announcing Frictionless Data Community Virtual Hangout – 20 April

- April 16, 2020 in Frictionless Data

Photo by William White on Unsplash We are thrilled to announce we’ll be co-hosting a virtual community hangout with Datopian to share recent developments in the Frictionless Data community. This will be a 1-hour meeting where community members come together to discuss key topics in the data community. Here are some key discussions we hope to cover:
  • Introductions & share the purpose of this hangout.
  • Share the update on the new website release and general Frictionless Data related updates.
  • Have community members share their thoughts and general feedback on Frictionless Data.
  • Share information about CSV Conf.
The hangout is scheduled to happen on 20th April 2020 at 5 pm CET. If you would like to attend, you can sign up for the event in advance here. Everyone is welcome. Looking forward to seeing you there!

Tracking the Trade of Octopus (and Packaging the Data)

- March 13, 2020 in Frictionless Data, Open Knowledge

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/. By Lily Zhao

Introduction

When I started graduate school, I was shocked to learn that seafood is actually the most internationally traded food commodity in the world. In fact, the global trade in fish is worth more than the trades of tea, coffee and sugar combined (Fisheries FAO, 2006). However, for many developing countries being connected to the global seafood market can be a double-edged sword. It is true global trade has the potential to redistribute some wealth and improve the livelihoods of fishers and traders in these countries. But it can also promote illegal trade and overfishing, which can harm the future sustainability of a local food source. Over the course of my master’s degree, I developed a passion for studying these issues, which is why I am excited to share with you my experience turning some of the data my collaborators into a packaged dataset using the Open Knowledge Foundation’s Datapackage tool. These data provide a snapshot into the global market for octopus and how it is traded throughout and between Kenya, Tanzania and Mozambique before heading to European markets. This research project was an international collaboration between the Stockholm Resilience Centre in Sweden, the National Institute for Medical Research, of Tanzania, Pwani University in Kilifi, Kenya and the School of Marine and Environmental Affairs at the University of Washington. These data eventually became my master’s thesis and this data package will complement a forthcoming publication of our findings. Specifically, these data are the prices and quantities at which middlemen in Tanzania and Kenya reported buying and selling octopus. These data are exciting because they not only inform our understanding of who is benefiting from the trade of octopus by also could assist in improving the market price octopus in Tanzania. This is because value chain information can help Tanzania’s octopus fishery along its path to Marine Stewardship Council seafood certification. Seafood that gets the Marine Stewardship Council Label gains a certain amount of credibility which in turn can increase profit. For developing countries, this seafood label can provide a monetary incentive for improving fisheries management. But before Tanzania’s octopus fishery can get certified, they will need to prove they can trace the flow of their octopus supply chain, and manage it sustainably. We hope that this packaged dataset will ultimately inform this effort.

Getting the data

To gather the data my field partner Chris and I went to 10 different fishing communities like this one. mtwara

Middlemen buy and sell seafood in Mtwara, Tanzania.

We went on to interview all the major exporters of octopus in both Tanzania and Kenya and spoke with company agents and octopus traders who bought their octopus from 570 different fishermen. With these interviews were able to account for about 95% of East Africa’s international octopus market share. Octopus

My research partner- Chris Cheupe, and I at an octopus collection point.

Creating the Data Package

The datapackage tool was created by the Open Knowledge Foundation to compile our data and metadata in a compact unit, making it easier and more efficient for others to access. You can create the data package using the online platform or using the Python or R programming software libraries. I had some issues using the R package instead of the online tool initially, which may have been related to the fact that the original data file was not utf-8 encoded. But stay tuned! For now, I made my datapackage using the Data Package Creator online tool. The tool helped me create a schema that outlines the data’s structure including a description of each column. The tool also helps you outline the metadata for the dataset as a whole, including information like the license and author. Our dataset has a lot of complicated columns and the tool gave me a streamlined way to describe each column via the schema. Afterwords, I added the metadata using the lefthand side of the browser tool and checked to make sure that the data package was valid!   valid data

The green bar at the top of the screenshot indicates validity

If the information you provide for each column does not match the data within the columns the package will not validate and instead, you will get an error like this: invalid data  

The red bar at the top of the screenshot indicates invalidity

Checkout my final datapackage by visiting my github repository!

Reference:

Fisheries, F. A. O. (2006). The state of world fisheries and aquaculture 2006.

Announcing the 2020 Frictionless Data Tool Fund

- March 2, 2020 in Frictionless Data

Apply for a mini-grant to build an open source tool for reproducible research using Frictionless Data tooling, specs, and code base.

Today, Open Knowledge Foundation is launching the second round of the Frictionless Data Tool Fund, a mini-grant scheme offering grants of $5,000 to support individuals or organisations in developing an open tool for reproducible science or research built using the Frictionless Data specifications and software. We welcome submissions of interest until 17th May 2020. The Tool Fund is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate reproducible data workflows in research contexts. At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata.  With this announcement we are looking for individuals or organizations of scientists, researchers, developers, or data wranglers to build upon our existing open source tools and code base to create novel tooling for reproducible research. We will prioritize tools focusing on the following fields/domains of science: biology, genetics, neuroscience, ecology, geology, and bioinformatics. The fund will be accepting submissions from now until 1st May, with projects starting mid-June and to be completed by the end of the year. This builds on the success of the 2019 Tool Fund, which funded the creation of four tools: a tool to convert the biodiversity DarwinCore Archive into Frictionless data packages; a tool that bundles Open Referral data as data packages; a tool to export Neuroscience Experiments System data as data packages; and a tool to import and export data packages in Google Sheets. For this year’s Tool Fund, we would like the community to work on tools that can make a difference to researchers and scientists in the following domains: biology, genetics, neuroscience, ecology, geology, and bioinformatics.  Applications can be submitted by filling out this form by 1st May. The Frictionless Data team will notify all applicants whether they have been successful or not at the very latest by mid-June. Successful candidates will then be invited for interviews before the final decision is given. We will base our choice on evidence of technical capabilities and also favour applicants who demonstrate an interest in practical use of the Frictionless Data Specifications. Preference will also be given to applicants who show an interest working with and maintaining these tools going forward. For more questions on the fund, speak directly to us on our forum, on our Gitter chat or email us at frictionlessdata@okfn.org.

Data package is valid!

- February 28, 2020 in Frictionless Data

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/. By Ouso Daniel The last few months have been exciting, to say the least. I dug deep into seeking to understand how to minimise friction in data workflows and promote openness and reproducibility. I have been able to Know of various FD software for improving data publishing workflows through the FD Field Guide. We’ve looked at a number of case studies where FD synergised well for reproducibility, an example is on the eLife study. We also looked at contributing and coding best practices. Moreover, I found Understanding JSON schemas (by json-schema.org) a great guide in understanding the data package schema, which is JSON-based. It all culminated in the creation of a data package, which I now want to share my experience. To quality-check the integrity of your data package creation, you must validate it before downloading it for sharing, among many things. The best you can get from that process is “Data package is valid!”. What about before then?

Data package

Simply, I would say, it is data coupled to its associated attributes in a JSON format. To marry the data to its attributes you will require an FD tool. Here is the one I created.

Data Package Creator (DPC)

A DPC gives you a data package. The good news is that it takes care of both realms of users; programmers and GUI users. I will describe the latter case. It is a web app with three main components: One, the Metadata pane on the left. Two, the Resources (a data article) pane in the middle and the third is the Schema on the right pane (usually hidden, but can be exposed by clicking the three-dots-in-curly-brackets icon).

The Data

I used my project data in which I was evaluating the application of a molecular technique, high-resolution melting analysis, in the identification of wildlife species illegally targeted as bushmeat. I had two files containing tabular data: one with sample information on samples analysed and sequences deposited in GenBank and the other on species identification blind validation across three mitochondrial markers. My data package thus had two resources. This data was contained in my local repository, but I shipped it into GitHub in the CSV format for easy accessibility.

Creating the Data Package

You may follow along, in details, with this data package specifications. On the resources pane tab, from left to right, I entered a name for my resource and the path. I pasted the raw GitHub link to my data on the provided path field and clicked the load button to the right. Locally, you may click the load button that will pop your local file system. DPC automatically inferred the data structure and prompted me to load the inferred fields (columns). I counter checked that the data types for each field were correctly inferred, and added titles and descriptions. The data format for each field was left as default. From the gear-wheel (settings) in the resource tab, I gave each of the two resources titles, descriptions, format and encoding. The resource profile is also automatically inferred. All the field and resource metadata data that I inputted are optional, except we want to intentionally be reproducible and open. On the other hand, there is compulsory metadata information for the general data package, in the metadata pane. They are name and title. Be sure to get the name right, it must match the pattern ^([-a-z0-9._/])+$ for the data package to be valid, it is the most probable error you might encounter. The data package provides for very rich metadata capturing, which is one of its strengths for data reusability. There are three metadata categories, which must not be confused; data package metadata, resource metadata and field (column) metadata, respectively nested. After inputting all the necessary details in the DPC you have to validate your data package before downloading it. The two click-buttons for these purposes are at the bottom of the metadata pane. Any error(s) will be captured and described at the very top of the resources pane. Alternatively, you will see the title of this post, upon which you can download your data package and rename it accordingly, retaining the .json extension.

Conclusion

I applied DPC first-hand in my research, so can you. We created a data package starting from and ending with the most widely used data organisation formats, CSV and JSON respectively (interoperability). We gave it adequate metadata to allow a stranger to comfortably make sense of the data (reusability) and provided licence information, CC-BY-SA-4.0 (accessibility). The data package is also uniquely identified and made available on a public repository in GitHub (findability). A FAIR data package. Moreover, the data package is very light (portable) making it easily sharable, and open and reproducible. The package is holistic, containing metadata, data and a schema (a blueprint for data structure and metadata). How do I use the data package? You may ask.

Way forward

Keep in memory the term goodtables, I will tell you how it is useful with the data package we just created. Until then you may keep in touch by reading periodic blogs regarding the Frictionless Data fellowship, where you will also find works by my colleagues Sele, Monica and Lily. Follow meOKF on twitter for flash updates.

Combating other people’s data

- February 18, 2020 in Frictionless Data, Open Knowledge

Frictionless Data Pipelines for Ocean Science

- February 10, 2020 in Frictionless Data, Open Knowledge

This blog post describes a Frictionless Data Pilot with the Biological and Chemical Oceanography Data Management Office (BCO-DMO). Pilot projects are part of the Frictionless Data for Reproducible Research project. Written by the BCO-DMO team members Adam Shepherd, Amber York, Danie Kinkade, and development by Conrad Schloer.   Scientific research is implicitly reliant upon the creation, management, analysis, synthesis, and interpretation of data. When properly stewarded, data hold great potential to demonstrate the reproducibility of scientific results and accelerate scientific discovery. The Biological and Chemical Oceanography Data Management Office (BCO-DMO) is a publicly accessible earth science data repository established by the National Science Foundation (NSF) for the curation of biological, chemical, and biogeochemical oceanographic data from research in coastal, marine, and laboratory environments. With the groundswell surrounding the FAIR data principles, BCO-DMO recognized an opportunity to improve its curation services to better support reproducibility of results, while increasing process efficiencies for incoming data submissions. In 2019, BCO-DMO worked with the Frictionless Data team at Open Knowledge Foundation to develop a web application called Laminar for creating Frictionlessdata Data Package Pipelines that help data managers process data efficiently while recording the provenance of their activities to support reproducibility of results.  
The mission of BCO-DMO is to provide investigators with data management services that span the full data lifecycle from data management planning, to data publication, and archiving.

BCO-DMO provides free access to oceanographic data through a web-based catalog with tools and features facilitating assessment of fitness for purpose. The result of this effort is a database containing over 9,000 datasets from a variety of oceanographic and limnological measurements including those from: in situ sampling, moorings, floats and gliders, sediment traps; laboratory and mesocosm experiments; satellite images; derived parameters and model output; and synthesis products from data integration efforts. The project has worked with over 2,600 data contributors representing over 1,000 funded projects.  As the catalog of data holdings continued to grow in both size and the variety of data types it curates, BCO-DMO needed to retool its data infrastructure with three goals. First, to improve the transportation of data to, from, and within BCO-DMO’s ecosystem. Second, to support reproducibility of research by making all curation activities of the office completely transparent and traceable. Finally, to improve the efficiency and consistency across data management staff. Until recently, data curation activities in the office were largely dependent on the individual capabilities of each data manager. While some of the staff were fluent in Python and other scripting languages, others were dependent on in-house custom developed tools. These in-house tools were extremely useful and flexible, but they were developed for an aging computing paradigm grounded in physical hardware accessing local data resources on disk. While locally stored data is still the convention at BCO-DMO, the distributed nature of the web coupled with the challenges of big data stretched this toolset beyond its original intention. 
In 2015, we were introduced to the idea of data containerization and the Frictionless Data project in a Data Packages BoF at the Research Data Alliance conference in Paris, France. After evaluating the Frictionless Data specifications and tools, BCO-DMO developed a strategy to underpin its new data infrastructure on the ideas behind this project.
While the concept of data packaging is not new, the simplicity and extendibility of the Frictionless Data implementation made it easy to adopt within an existing infrastructure. BCO-DMO identified the Data Package Pipelines (DPP) project in the Frictionless Data toolset as key to achieving its data curation goals. DPP implements the philosophy of declarative workflows which trade code in a specific programming language that tells a computer how a task should be completed, for imperative, structured statements that detail what should be done. These structured statements abstract the user writing the statements from the actual code executing them, and are useful for reproducibility over long periods of time where programming languages age, change or algorithms improve. This flexibility was appealing because it meant the intent of the data manager could be translated into many varying programming (and data) languages over time without having to refactor older workflows. In data management, that means that one of the languages a DPP workflow captures is provenance – a common need across oceanographic datasets for reproducibility. DPP Workflows translated into records of provenance explicitly communicates to data submitters and future data users what BCO-DMO had done during the curation phase. Secondly, because workflow steps need to be interpreted by computers into code that carries out the instructions, it helped data management staff converge on a declarative language they could all share. This convergence meant cohesiveness, consistency, and efficiency across the team if we could implement DPP in a way they could all use.  In 2018, BCO-DMO formed a partnership with Open Knowledge Foundation (OKF) to develop a web application that would help any BCO-DMO data manager use the declarative language they had developed in a consistent way. Why develop a web application for DPP? As the data management staff evaluated DPP and Frictionless Data, they found that there was a learning curve to setting up the DPP environment and a deep understanding of the Frictionlessdata ‘Data Package’ specification was required. The web application abstracted this required knowledge to achieve two main goals: 1) consistently structured Data Packages (datapackage.json) with all the required metadata employed at BCO-DMO, and 2) efficiencies of time by eliminating typos and syntax errors made by data managers.  Thus, the partnership with OKF focused on making the needs of scientific research data a possibility within the Frictionless Data ecosystem of specs and tools. 
Data Package Pipelines is implemented in Python and comes with some built-in processors that can be used in a workflow. BCO-DMO took its own declarative language and identified gaps in the built-in processors. For these gaps, BCO-DMO and OKF developed Python implementations for the missing declarations to support the curation of oceanographic data, and the result was a new set of processors made available on Github.
Some notable BCO-DMO processors are: boolean_add_computed_field – Computes a new field to add to the data whether a particular row satisfies a certain set of criteria.
Example: Where Cruise_ID = ‘AT39-05’ and Station = 6, set Latitude to 22.1645. convert_date – Converts any number of fields containing date information into a single date field with display format and timezone options. Often data information is reported in multiple columns such as `year`, `month`, `day`, `hours_local_time`, `minutes_local_time`, `seconds_local_time`. For spatio-temporal datasets, it’s important to know the UTC date and time of the recorded data to ensure that searches for data with a time range are accurate. Here, these columns are combined to form an ISO 8601-compliant UTC datetime value. convert_to_decimal_degrees –  Convert a single field containing coordinate information from degrees-minutes-seconds or degrees-decimal_minutes to decimal_degrees. The standard representation at BCO-DMO for spatial data conforms to the decimal degrees specification.
reorder_fields –  Changes the order of columns within the data. This is a convention within the oceanographic data community to put certain columns at the beginning of tabular data to help contextualize the following columns. Examples of columns that are typically moved to the beginning are: dates, locations, instrument or vessel identifiers, and depth at collection.  The remaining processors used by BCO-DMO can be found at https://github.com/BCODMO/bcodmo_processors

How can I use Laminar?

In our collaboration with OKF, BCO-DMO developed use cases based on real-world data submissions. One such example is a recent Arctic Nitrogen Fixation Rates dataset.   Arctic dataset  The original dataset shown above needed the following curation steps to make the data more interoperable and reusable:
  • Convert lat/lon to decimal degrees
  • Add timestamp (UTC) in ISO format
  • ‘Collection Depth’ with value “surface” should be changed to 0
  • Remove parenthesis and units from column names (field descriptions and units captured in metadata).
  • Remove spaces from column names
The web application, named Laminar, built on top of DPP helps Data Managers at BCO-DMO perform these operations in a consistent way. First, Laminar prompts us to name and describe the current pipeline being developed, and assumes that the data manager wants to load some data in to start the pipeline, and prompts for a source location. Laminar After providing a name and description of our DPP workflow, we provide a data source to load, and give it the name, ‘nfix’.  In subsequent pipeline steps, we refer to ‘nfix’ as the resource we want to transform. For example, to convert the latitude and longitude into decimal degrees, we add a new step to the pipeline, select the ‘Convert to decimal degrees’ processor, a proxy for our custom processor convert_to_decimal_degrees’, select the ‘nfix’ resource, select a field form that ‘nfix’ data source, and specify the Python regex pattern identifying where the values for the degrees, minutes and seconds can be found in each value of the latitude column. processor step Similarly, in step 7 of this pipeline, we want to generate an ISO 8601-compliant UTC datetime value by combining the pre-existing ‘Date’ and ‘Local Time’ columns. This step is depicted below: date processing step After the pipeline is completed, the interface displays all steps, and lets the data manager execute the pipeline by clicking the green ‘play’ button at the bottom. This button then generates the pipeline-spec.yaml file, executes the pipeline, and can display the resulting dataset. all steps   data The resulting DPP workflow contained 223 lines across this 12-step operation, and for a data manager, the web application reduces the chance of error if this pipelines was being generated by hand. Ultimately, our work with OKF helped us develop processors that follow the DPP conventions.
Our goal for the pilot project with OKF was to have BCO-DMO data managers using the Laminar for processing 80% of the data submissions we receive. The pilot was so successful, that data managers have processed 95% of new data submissions to the repository using the application.
This is exciting from a data management processing perspective because the use of Laminar is more sustainable, and acted to bring the team together to determine best strategies for processing, documentation, etc. This increase in consistency and efficiency is welcomed from an administrative perspective and helps with the training of any new data managers coming to the team.  The OKF team are excellent partners, who were the catalysts to a successful project. The next steps for BCO-DMO are to build on the success of The Fricitonlessdata  Data Package Pipelines by implementing the Frictionlessdata Goodtables specification for data validation to help us develop submission guidelines for common data types. Special thanks to the OKF team – Lilly Winfree, Evgeny Karev, and Jo Barrett.