You are browsing the archive for Frictionless Data.

Announcing the New Frictionless Framework

- October 8, 2020 in Frictionless Data

By Evgeny Karev & Lilly Winfree

Frictionless Framework

We are excited to announce our new high-level Python framework, frictionless-py: https://github.com/frictionlessdata/frictionless-py. Frictionless-py was created to simplify overall user-experience for working with Frictionless Data in Python. It provides several high-level improvements in addition to many low-level fixes. Read more details below, or watch this intro video by Frictionless developer Evgeny: https://youtu.be/VPnC8cc6ly0  

Why did we write new Python code?

Frictionless Data has been in development for almost a decade, with global users and projects spanning domains from science to government to finance. However, our main Python libraries (datapackage-py, goodtables-py, tableschema-py, tabulator-py) were originally built with some inconsistencies that have confused users over the years. We had started redoing our documentation for our existing code, and realized we had a larger issue on our hands – mainly that the disparate Python libraries had overlapping functionalities and we were not able to clearly articulate how they all fit together to form a bigger picture. We realized that overall, the existing user experience was not where we wanted it to be. Evgeny, the Frictionless Data technical lead developer, had been thinking about ways to improve the Python code for a while, and the outcome of that work is frictionless-py.

What happens to the old Python code (datapackage-py, goodtables-py, tableschema-py, tabulator-py)? How does this affect current users?

Datapackage-py (see details), tableschema-py (see details), tabulator-py (see details) still exist, will not be altered, and will be maintained. If your project is using this code, these changes are not breaking and there is no action you need to take at this point. However, we will be focusing new development on frictionless-py, and encourage you to consider starting to experiment with or work with frictionless-py during the last months of 2020 and migrate to it starting from 2021 (here is our migration guide). The one important thing to note is that goodtables-py has been subsumed by frictionless-py (since version 3 of Goodtables). We will continue to bug-fix goodtables@2.x in this branch and it is also still available on PyPi as it was before. Please note that frictionless@3.x version’s API is not stable as we are continuing to work on it at the moment. We will release frictionless@4.x by the end of 2020 to be the first SemVer/stable version.

What does frictionless-py do?

Frictionless-py has four main functions for working with data: describe, extract, validate, and transform. These are inspired by typical data analysis and data management methods.  Describe your data: You can infer, edit and save metadata of your data tables. This is a first step for ensuring data quality and usability. Frictionless metadata includes general information about your data like textual description, as well as field types and other tabular data details. Extract your data: You can read your data using a unified tabular interface. Data quality and consistency are guaranteed by a schema. Frictionless supports various file protocols like HTTP, FTP, and S3 and data formats like CSV, XLS, JSON, SQL, and others. Validate your data: You can validate data tables, resources, and datasets. Frictionless generates a unified validation report, as well as supports a lot of options to customize the validation process. Transform your data: You can clean, reshape, and transfer your data tables and datasets. Frictionless provides a pipeline capability and a lower-level interface to work with the data. Additional features: 
  • Powerful Python framework
  • Convenient command-line interface
  • Low memory consumption for data of any size
  • Reasonable performance on big data
  • Support for compressed files
  • Custom checks and formats
  • Fully pluggable architecture
  • The included API server
  • More than 1000+ tests

How can users get started?

We recommend that you begin by reading the Getting Started Guide and the Introduction Guide. We also have in depth documentation for Describing Data, Extracting Data, Validating Data, and Transforming Data.

How can you give us feedback?

What do you think? Let us know your thoughts, suggestions, or issues by joining us in our community chat on Discord or by opening an issue in the frictionless-py repo: https://github.com/frictionlessdata/frictionless-py/issues.

FAQ’s

Where’s the documentation?

Are you a new user? Start here: Getting Started & Introduction Guide Are you an existing user? Start here: Migration Guide The full list of documentation can be found here: https://github.com/frictionlessdata/frictionless-py#documentation 

What’s the difference between datapackage and frictionless?

In general, frictionless is our new generation software while tabulator/tableschema/datapackage/goodtables is our previous generation software. Frictionless has a lot of improvements over them. Please see this issue for the full answer and a code example: https://github.com/frictionlessdata/frictionless-py/issues/428

I’ve spotted a bug – where do I report it?

Let us know by opening an issue in the frictionless-py repo: https://github.com/frictionlessdata/frictionless-py/issues. For tabulator/tableschema/datapackage issues, please use the corresponding issue tracker and we will triage it for you. Thanks!

I have a question – where do I get help?

You can ask us questions in our Discord chat and someone from the main developer team or from the community will help you. Here is an invitation link: https://discord.com/invite/j9DNFNw. We also have a Twitter account (@frictionlessd8a) and community calls where you can come meet the team and ask questions: http://frictionlessdata.io/events/

I want to help – how do I contribute?

Amazing, thank you! We always welcome community contributions. Start here (https://frictionlessdata.io/contribute/) and here (https://github.com/frictionlessdata/frictionless-py/blob/master/CONTRIBUTING.md) and you can also reach out to Evgeny (@roll) or Lilly (@lwinfree) on GitHub if you need help.

Additional Links/Resources

An update from the 2020 Frictionless Tool Fund grantees

- September 30, 2020 in Frictionless Data

We are excited to share project updates from our 2020 Frictionless Data Tool Fund! Our five grantees are about half-way through their projects and have written updates below to share with the community. These grants have been awarded to projects using Frictionless Data to improve reproducible data workflows in various research contexts. Read on to find out what they have been working on and ways that you can contribute!

Carles Pina Estany: Schema Collaboration

The goal of the schema-collaboration tool fund is to create an online platform to enable data managers and researchers to collaborate on describing their data through writing Frictionless data package schemas. The basics can be seen and tested on the online instance of the platform: the data manager can create a package, assign data packages to researchers, add comments and send a link to the researchers which will use datapackage-ui to edit the package and save it, making it available for the data manager. The next steps are to add extra fields to datapackage-ui and to work on the integration between schema-collaboration and datapackage-ui to make maintenance easier. Carles also plans to have an output of the datapackage as a PDF to help data managers and researchers spot errors. Progress can be followed through the project Wiki and feedback would be welcome through Github issues. Read more about Carles’ project here: https://frictionlessdata.io/blog/2020/07/16/tool-fund-polar-institute/  

Simon Tyrrell: Frictionless Data for Wheat

As part of the Designing Future Wheat project, Simon and team have repositories containing a wide variety of heterogeneous data. They are trying to standardise how to expose these datasets and their associated metadata. The first of their portals stores its data in an iRODS (https://irods.org/) repository. They have recently completed the additions to our web module, eirods-dav, that uses the files, folders and metadata stored within this repository to automatically generate the Data Packages for the datasets. The next step is to look at expanding the data that is added to the Data Packages and similarly automatically expose tabular data as Tabular Data Packages. The eirods-dav GitHub repository is at https://github.com/billyfish/eirods-dav and any feedback or queries are very welcome. Read more about Simon’s project here: https://frictionlessdata.io/blog/2020/08/17/frictionless-wheat/  

Stephen Eglen: Analysis of spontaneous activity patterns in developing neural circuits using Frictionless Data tools

Stephen and Alexander have been busy over the summer integrating the frictionless tools into a workflow for analysis electrophysiological datasets. They have written converters to read in their ASCII- and HDF5-based data and convert them to frictionless containers.  Along the way, they have given helpful feedback to the team about the core packages. They have settled on the python interface as the most feature rich implementation to work with.  Alexander has now completed his analysis of the data, and we are currently working on a manuscript to highlight our research findings. Read more about Stephen’s project here: https://frictionlessdata.io/blog/2020/08/03/tool-fund-cambridge-neuro/  

Asura Enkhbayar: Metrics in Context

How much do we know about the measurement tools used to create scholarly metrics? While data models and standards are neither new nor uncommon to the scholarly space, “Metrics in Context” is all about the very apparatuses we use to capture the scholarly activity embedded in those metrics. In order to confidently use citations and altmetrics in research assessment or hiring and promotion decisions, we need to be able to provide standardized descriptions of the involved digital infrastructure and acts of capturing. Asura is currently refining the conceptual model for scholarly events in the digital space in order to be able to account for various types of activities (both traditional and alternative scholarly metrics). After a review of the existing digital landscape of scholarly infrastructure projects, he will dive into the implementation using Frictionless. You can find more details on the open roadmap on Github and feel free to submit questions and comments as issues! Read more about Asura’s project here: https://frictionlessdata.io/blog/2020/09/17/tool-fund-metrics/  

Nikhil Vats: Adding Data Package Specifications to InterMine’s im-tables

Nikhil is working with InterMine to add data package specifications to im-tables (a library to query biological data) so that users can export metadata along with query results. Right now, the metadata contains field names, their description links, types, paths, class description links and primary key(s). Nikhil is currently figuring out ways to get links for data sources, attribute descriptions and class descriptions from their fair terms (or description links). Next steps for the project include building the frontend for this feature in im-tables and getting the rest of required information like result file format (CSV, TSV, etc.) about data in the datapackage.json (metadata) file. You can contribute to this project by opening an issue here or reaching out at chat.intermine.org. Read more about Nikhil’s project here: https://frictionlessdata.io/blog/2020/07/10/tool-fund-intermine/

Goodtables: Expediting the data submission and submitter feedback process

- September 16, 2020 in Frictionless Data

by Adam Shepherd, Amber York, Danie Kinkade, and Lilly Winfree This post, originally published on the BCO-DMO blog, describes the second part of our Frictionless Data Pilot collaboration.   Logos for Goodtables and BCO-DMO   Earlier this year, the Biological and Chemical Oceanography Data Management Office (BCO-DMO) completed a pilot project with the Open Knowledge Foundation (OKF) to streamline the data curation processes for oceanographic datasets using Frictionless Data Pipelines (FDP). The goal of this pilot was to construct reproducible workflows that transformed the original data submitted to the office into archive-quality, FAIR-compliant versions. FDP lets a user define an order of processing steps to perform on some data, and the project developed new processing steps specific to the needs of these oceanographic datasets. These ordered processing steps are saved into a configuration file that is then available to be used anytime the archived version of the dataset must be reproduced. The primary value of these configuration files is that they capture and make the curation process at BCO-DMO transparent. Subsequently, we found additional value internally by using FDP in three other areas. First, they made the curation process across our data managers much more consistent versus the ad-hoc data processing scripts they individually produced before FDP. Second, we found that data managers saved time because they could reuse pre-existing pipelines to process newer versions submitted for pre-existing datasets. Finally, the configuration files helped us keep track of what processes were used in case a bug or error was ever found in the processing code. This project exceeded our goal of using FDP on at least 80% of data submissions to BCO-DMO to where we now use it almost 100% of the time. As a major deliverable from BCO-DMO’s recent NSF award the office planned to refactor its entire data infrastructure using techniques that would allow BCO-DMO to respond more rapidly to technological change. Using Frictionless Data as a backbone for data transport is a large piece of that transformation. Continuing to work with OKF, both groups sought to continue our collaboration by focusing on how to improve the data submission process at BCO-DMO.  
Goodtables detects a duplication error

Goodtables noticed a duplicate row in an uploaded tabular data file.

  Part of what makes BCO-DMO a successful data curation office is our hands-on work helping researchers achieve compliance with the NSF’s Sample and Data Policy coming from their Ocean Sciences division. Yet, a steady and constant queue of data submissions means that it can take some weeks before our data managers can thoroughly review data submissions and provide necessary feedback to submitters. In response, BCO-DMO has been creating a lightweight web application for submitting data while ensuring such a tool preserves the easy experience of submitting data that presently exists. Working with OKF, we wanted to expedite the data review process by providing data submitters with as much immediate feedback as possible by using Frictionless Data’s GoodTables project. Through a data submission platform, researchers would be able to upload data to BCO-DMO and, if tabular, get immediate feedback from Goodtables about whether it was correctly formatted or any other quality issues existed. With these reports at their disposal, submitters could update their submissions without having to wait for a BCO-DMO data manager to review. For small and minor changes this saves the submitter the headache of having to wait for simple feedback. The goal is to catch submitters at a time where they are focused on this data submission so that they don’t have to return weeks later and reconstitute their headspace around these data again. We catch them when their head is in the game. Goodtables provides us a framework to branch out beyond simple tabular validation by developing data profiles. These profiles would let a submitter specify the type of data they are submitting. Is the data a bottle or CTD file? Does it contain latitude, longitude time or depth observations? These questions, optional for submitters to answer, would provide even further validation steps to get improved feedback immediately. For example, specifying that a file contains latitude or longitude columns could detect whether all values fall within valid bounds. Or that a depth column contains values above the surface. Or that the column pertaining to the time of an observation has inconsistent formatting across some of the rows. BCO-DMO can expand on this platform to continue to add new and better quality checks that submitters can use.
Goodtables detects incorrect longitudes

Goodtables noticed a longitude that is outside a range of -180 to 180. This happended because BCO-DMO recommends using decimal degrees format between -180 t0 180 and defined a Goodtables check for longitude fields.

Frictionless Data Monthly Virtual Hangout – 27 August

- August 24, 2020 in Frictionless Data

Join the Frictionless Data group for a virtual hangout on 27 August! These monthly hangouts are a casual opportunity to meet other Frictionless Data users and the main contributor team, ask questions, and learn about recent developments. We will spend extra time during this call discussing the newly-released Python code Frictionless-py https://github.com/frictionlessdata/frictionless-py and would love to hear any feedback! The hangout is scheduled to occur on 27th August 2020 at 5 pm BST / 4 PM UTC. This will be a 1-hour meeting where community members come together to discuss key topics in the data community. If you would like to attend the hangout, you can sign up for the event using this form. We hope to see you there! PS – you can follow the Frictionless Data project on Twitter at https://twitter.com/frictionlessd8a and we also have an events calendar at https://frictionlessdata.io/events/.

Clarifying the semantics of data matrices and results tables: a Frictionless Data Pilot

- July 21, 2020 in Frictionless Data, Genomics, pilot

As part of the Frictionless Data for Reproducible Research project, funded by the Sloan Foundation, we have started a Pilot collaboration with the  Data Readiness Group  at the Department of Engineering Science of the University of Oxford; the group will be represented by Dr. Philippe Rocca-Serra, an Associate Member of Faculty. This Pilot will focus on removing the friction in reported scientific experimental results by applying the Data Package specifications. Written with Dr. Philippe Rocca-Serra. Oxford department of engineering science logo Oxford Data Readiness Group Publishing of scientific experimental results is frequently done in ad-hoc ways that are seldom consistent. For example, results are often deposited as idiosyncratic sets of Excel files or tabular files that contain very little structure or description, making them difficult to use, understand and integrate. Interpreting such tables requires human expertise, which is both costly and slow, and leads to low reuse.  Ambiguous tables of results can lead researchers to rerun analysis or computation over the raw data before they understand the published tables. This current approach is broken, does not fit users’ data mining workflows, and limits meta-analysis. A better procedure for organizing and structuring information would reduce unnecessary use of computational resources, which is where the Frictionless Data project comes into play. This Pilot collaboration aims to help researchers publish their results in a more structured, reusable way. In this Pilot, we will use (and possibly extend) Frictionless tabular data packages to devise both generic and specialized templates. These templates can be used to unambiguously report experimental results. Our short term goal from this work is to develop a set of Frictionless Data Packages for targeted use cases where impact is high. We will first focus first on creating templates for statistical comparison results, such as differential analysis, enrichment analysis, high-throughput screens, and univariate comparisons, in genomics research by using the STATO ontology within tabular data packages.  Our longer term goals are that these templates will be incorporated into publishing systems to allow for more clear reporting of results, more knowledge extraction, and more reproducible science.  For instance, we anticipate that this work will allow for increased consistency of table structure in publications, as well as increased data reuse owing to predictable syntax and layout. We also hope this work will ease creation of linked data graphs from table of results due to clarified semantics.  An additional goal is to create code that is compatible with R’s ggplot2 library, which would allow for easy generation of data analysis plots.  To this end, we plan on working with R developers in the future to create a package that will generate Frictionless Data compliant data packages.  This work has recently begun, and will continue throughout the year. We have already met with some challenges, such as working on ways to transform, or normalize, data and ways to incorporate RDF linked data (you can read our related conversations in GitHub). We are also working on how to define a ‘generic’ table layout definition, which is broad enough to be reused in as wide a range of situation as possible. If you are interested in staying up to date on this work, we encourage you to check out these GitHub repositories: https://gitlab.com/datascriptor/datascriptor-fldatapackages and https://github.com/ISA-tools/frictionless-collab. Additionally, we will (virtually) be at the eLife Sprint in September to work on closely related work, which you can read about here: https://sprint.elifesciences.org/data-paper-skeleton-tools-for-life-sciences/.  Throughout this Pilot, we are planning on reaching out to the community to test these ideas and get feedback. Please contact us on GitHub or in Discord if you are interested in contributing.

Reflecting on the first cohort of Frictionless Data Reproducible Research fellows

- June 9, 2020 in Frictionless Data

It is truly bittersweet to say that we are at the end of the first cohort of the Frictionless Data Reproducible Research fellows. Over the past nine months, I have had the pleasure of working with Monica Granados, Selene Yang, Daniel Ouso and Lily Zhao during the fellows programme. Combining their diverse backgrounds (from government data to mapping data, from post-PhD to graduate student), they have spent many hours together learning how to advocate for open science and how to use the Frictionless Data code and tools to make their data more reproducible. Together, they have also written several blogposts, presented a talk and given a workshop. And they did all of this during a global pandemic! I feel lucky to have worked with them, and will be eagerly watching their contributions to the open science space. Each fellow wrote a final blogpost reflecting on their time with the programme. You can read the originals here, and I have also republished them below:

Lily Zhao: Reflecting on my time as a fellow

As one of the inaugural Reproducible Research Fellows of Frictionless Data, I am eager to share my experience of the program with you about working with Sele, Ouso and Monica under the leadership of Lilly Winfree this year. I could not have asked for a better group of individuals to work remotely with. Sele, Ouso, Monica and I spent the last nine months discussing common issues in research reproducibility and delving into the philosophy behind open data science. Together we learned to apply Frictionless Data tools to our own data and mastered techniques for streamlining the reproducibility of our own research process. Lilly was an excellent mentor throughout the program and was always there to help with any issues we ran into. This was also one of my first experiences working entirely remotely on a team across many time zones. Through the use of Google hangout, Zoom and Slack the entire process was easier than I ever thought it could be. It is wonderful that through technology we are able to collaborate across the world easier than ever before. We were also able to give multiple presentations together. Monica and I were joint speakers as part of the csv conference where we talk about our experience as fellows, and our experience using Frictionless Data tools. With so many people on the Zoom call it really felt like were part of a large community. The four of us also led a hands-on workshop introducing the Data Package Creator and GoodTables web interface tools. This was especially fun for me because we used a subset of my French Polynesia interview data as practice data for all workshop participants. Many of the questions asked by participants mirrored questions the four of us had already worked through together, so it was great to be able to share what we had learned with others. I look forward to sharing these tools and the philosophy of open data science throughout my career and am very grateful to the Open Knowledge Foundation for this amazing learning opportunity. If you would like to learn more about my experience in the Frictionless Data Fellows program please feel free to reach out to me personally! Monica, Sele, Lilly, Ouso and I on our most recent conference call :)

Monica Granados: Gimme Dat Data (in a validated Data Package)

As a scientist I collect a lot of data. Especially about animals that live in the water – fish, mussels, crayfish. This data is not only useful to me but it can be used by others to improve the power in their studies, increase geographic range or phylogenetic diversity for example. Prior to the Frictionless Data for Reproducible Research Fellowship, I had my data on GitHub along with a script that would use rcurl to pull the data from the repository. While the repository was accompanied by a README, the file didn’t have much information other than the manuscript which included the data. This structure facilitated reproducibility but not reusability. Conceivably if you wanted to use my data for your own experiments you could have contextualised the data by using the relevant manuscript, but it still would have been a challenge without any metadata, not to mention any potential structural errors you could have encountered that I didn’t catch when I uploaded the data. It was through the introduction of Frictionless Tools, however that I realised that there was more I could do to make my science even more transparent, reproducible and reusable. The fellowship syllabus was structured in such a way that by learning about the tools we learned what the tools were facilitating – better data sharing. The fellows would learn how to use the tool through a self guided lesson and then answer questions on Slack which asked us to interrogate why the tool was built the way it was. These lessons were also supported by calls with the full cohort of fellows where we discussed what we had learned, problems we were encountering as we used the tools with our own data and reviewed papers on open science. The fellowship culminated with a workshop delivered by all four fellows attended by over 40 participants and a presentation at csv,conf. Now when I share data as a data package I know I have validated by tabular data for structural errors and the file contains metadata that contextualises the data. Having the opportunity to be a part of the inaugural cohort has been a wonderful experience. I learned new tools and information that I will take and share for the rest of my career, but also gained new colleagues and open science friends in my fellow fellows.

Daniel Ouso: Better Data, one resource at a time – my fellowship experience

Getting into the Frictionless Data fellowship

My background is largely in molecular biology, particularly infection diagnostics targeting arthropod viruses, bacteria and protozoa. I have a relatively shorter bioinformatics experience, but this is the direction am passionate to build my research occupation in. I first heard about Frictionless data from the African Carpentries instructors’ mailing list. It was the inaugural fellowship call that had been shared by Anelda. I caught it at the nick of time; deadline submission! By the way, you can watch for annual calls and other interesting stuff by following @frictionlessd8a. The call for the second cohort just closed in June and was open from late April. The fellowship starts in September.

On-boarding

Lilly arranged the first-time meeting to usher me into the fellow, after a few email correspondence. I got introduced Jo Barrat who patiently took me through my paces completing logistical preliminaries. I was really looking forward to getting started. The on-boarding enabled acquaintance with the rest of the fellows, awesome people. I was excited!

Context

Overall, the world is in search of and is promoting better ways to work with data, whether it is collecting data or accessibility or novel ways to analyse high-throughput data or dedicated workflows to publish data alongside accustomed scientific publishing or moving/working with data across frameworks or merely storage and security. All these, plus other factors, provide avenues to exhaustively interrogate data in multiple ways, thus promoting improved data usefulness. This has been arguably under-appreciated in times past. Frictionless data, through its Progressive Data Toolkit and with the help of organisations like OKF and funding by Sloan Foundation, is dedicated to alleviating hindrances to some of the aforementioned efforts. People empowerment is a core resource to the #BetterData dream.

The fellowship

An aspect of any research is the collection of data, which is applied to test hypotheses under study. The underlying importance of data, good data for that matter, in research is therefore unquestionable. Approaches to data analysis may differ from field to field, yet there are conventional principles that do not discriminate fields; such are the targets to Frictionless Data. I jump at the opportunity to learn ways to ramp up my data workflow efficiency, with a touch of research openness and reproducibility. The journey took off withdrawing a meticulous roadmap, which I found very helpful, and seem to end with this – sharing my experience. In between exciting things happened. In case one was coming in a little rusty with their basic Python/R, they were catered for early on, though you didn’t exactly need them to use the tools. To say, literally, ZERO programming skills prerequisite. There were a plethora of resources, and help from the fellows, not to mention from the ever welcoming Lilly. The core sections of the Fellowship were prefaced by grasping basic components like the JSON schema data interchange format. Following were the core tools and their specifications. The Data Package Creator tool is impressively emphatic on capturing metadata, a backbone theme for reproducibility. I found Table Schema and Schema specifications initially confusing. Other fellows and I have previously shared on the Data Package Creator and GoodTables, tools for creating and validating data packages respectively. These tools are very progressive, continually incorporating feedback from the community, including fellows, to improve user experience. So don’t be surprised at a few changes since the fellows’ blogs. In fact, a new entrant, which I only knew of recently, is the DataHub tool – “Is a useful solution for sharing datasets, and discovering high-quality datasets that others have produced”. I am yet to check it out. Besides the main focus of the fellowship, I got to learn a lot covering organisational skills and tools such as GitHub projects, Toggl for time-monitoring, general remote working, among others. I got introduced to new communities/initiatives such as PREreview; my first time to participate in open research reviewing. The fellows were awesome to work with and Lilly Winfree provided the best mentorship. Sometimes problems are foreseen and contingencies planned, other times unforeseen surprises rear their heads into our otherwise “perfect” plan. Guess what? You nailed it! COVID-19. Such require adaptability akin to that of the fictional El Professor in the Money Heist. Since we could not organise the in-person seminar and/or workshops as part of the fellowship, we collectively adopted a virtual workshop. It went amazingly well.

What next

Acquired knowledge and skills become more useful when implemented. My goal is to apply them in every opportune opening and to keep learning other integrative tools. Yet, there is also this about knowledge; it is to be spread. I hope to compensate for suspended social sessions and to keep engagement with @frictionlessd8a to continue open and reproducible research advocacy.

Conclusion

Tools that need minimal to no coding experience support well the adoption of good data hygiene practices, more so in places with scanty coding expertise. The FD tools will surely help your workflows with some greasing regardless of your coding proficiency, especially for tabular data. This is especially needful seeing the deluge of data persistently churned out from various sources. Frictionless Data is for everyone working with data; researcher, data scientist or data engineers. The ultimate goal is to work with data in an open and reproducible way, which is consistent with modern scientific research practice. A concerted approach is also key, I am glad to have represented Africa in the fellowship. Do not hesitate to reach out if you think I can be resourceful to your cause.

Sele Yang: A seguir reproduciendo conocimiento!

Termina un gran proceso para la primera cohorte del Frictionless Data for Reproducible Research Fellowship. Un proceso de grandísimos y valiosos aprendizajes que, sí y sólo sí pudieron darse, gracias al trabajo colaborativo entre todas las personas que participaron. En un inicio, recuerdo el gran miedo (que de alguna forma todavía persiste, pero más levemente) de no contar con las habilidades técnicas requeridas para poder llevar a cabo mi proyecto, pero poco a poco fui conociendo y sintiéndome apoyada por mis compañeros y compañeras, que con muchísima paciencia me llevaron de la mano para no perderme en el proceso. Recorrí gracias a este equipo, las playas de M’orea con los datos de Lily, aprendí de formas de investigación por fuera de mi campo de experiencia con Ouso y Mónica. Reconocí el gran trabajo que realizan investigadores e investigadoras para defender el conocimiento más abierto, equitativo y accesible. Si bien nuestro recorrido compartido termina acá, puedo resaltar que a pesar de la crisis que nos llevó a cambiar muchas acciones con COVID-19 durante nuestro programa, logramos encontrarnos aunque fuese virtualmente, no sólo para compartir entre nosotres, sino también con una gran audiencia para nuestro taller sobre el uso de herramientas y metodologías del programa. Una gran actividad para reforzar la importancia de compartir conocimiento, y hacerlo más accesible, mucho más en tiempos de crisis. Agradezco al Open Knowledge Foundation por haber llevado a cabo este programa, y les invito a todas las personas a que recorran la información que produjimos durante estos meses de trabajo. Termino este proceso de aprendizaje con la convicción todavía más fuerte sobre lo necesario qué son los procesos colaborativos que buscan aperturar y democratizar la ciencia y el conocimiento. Mucho más en estos tiempos en los que la colaboración y puesta en común del aprendizaje nos hará más fuertes como sociedad.

Join the Frictionless Data workshop – 20 May

- April 28, 2020 in Frictionless Data

  Join us on 20 May at 4pm UK/10am CDT for a Frictionless Data workshop led by the Reproducible Research Fellows! This 1.5 hour long workshop will cover an introduction to the open source Frictionless Data tools. Participants will learn about data wrangling, including how to document metadata, package data into a datapackage, write a schema to describe data, and validate data. The workshop is suitable for beginners and those looking to learn more about using Frictionless Data.

Everyone is welcome to join, but you must register to attend using this link

The Fellows Programme is part of the Frictionless Data for Reproducible Research project overseen by the Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata. This workshop will be led by the members of the First Cohort of the Fellows Programme: Lily Zhao, Daniel Ouso, Monica Granados, and Selene Yang. You can read more about their work during this programme here: http://fellows.frictionlessdata.io/blog/. Additionally, applications are now open for the Second Cohort of Fellows. Read more about applying here: https://blog.okfn.org/2020/04/27/apply-now-to-become-a-frictionless-data-reproducible-research-fellow/

Apply now to become a Frictionless Data Reproducible Research Fellow

- April 27, 2020 in Frictionless Data

Frictionless Data Logo The Frictionless Data Reproducible Research Fellows Program, supported by the Sloan Foundation, aims to train graduate students, postdoctoral scholars, and early career researchers how to become champions for open, reproducible research using Frictionless Data tools and approaches in their field. Apply today to join the Second Cohort of Frictionless Data Fellows! Fellows will learn about Frictionless Data, including how to use Frictionless tools in their domains to improve reproducible research workflows, and how to advocate for open science.  Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content. In addition to mentorship, we are providing Fellows with stipends of $5,000 to support their work and time during the nine-month long Fellowship. We welcome applications using this form from 27th April until 1 June 2020, with the Fellowship starting in the late Summer. We value diversity and encourage applicants from communities that are under-represented in science and technology, people of colour, women, people with disabilities, and LGBTI+ individuals.

Frictionless Data for Reproducible Research

The Fellowship is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation, and is the second iteration. Frictionless Data aims to reduce the friction often found when working with data, such as when data is poorly structured, incomplete, hard to find, or is archived in difficult to use formats. This project, funded by the Sloan Foundation, applies our work to data-driven research disciplines, in order to help researchers and the research community resolve data workflow issues.  At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata. The Frictionless Data approach aims to address identified needs for improving data-driven research such as generalized, standard metadata formats, interoperable data, and open-source tooling for data validation.

Frictionless Data for Reproducible Research Fellows Programme

Fellowship program

The First Cohort of Fellows ran from Fall 2019 til Summer 2020, and you can read more about their work in the Fellows blog: http://fellows.frictionlessdata.io/blog/.  During the Fellowship, our team will be on hand to work closely with you as you complete the work. We will help you learn Frictionless Data tooling and software, and provide you with resources to help you create workshops and presentations. Also, we will announce Fellows on the project website and will be publishing your blogs and workshops slides within our network channels.  We will provide mentorship on how to work on an Open project, and will work with you to achieve your Fellowship goals.

How to apply

The Fund is open to early career research individuals, such as graduate students and postdoctoral scholars, anywhere in the world, and in any scientific discipline. Successful applicants will be enthusiastic about reproducible research and open science, have some experience with communications, writing, or giving presentations, and have some technical skills (basic experience with Python, R, or Matlab for example), but do not need to be technically proficient. If you are interested, but do not have all of the qualifications, we still encourage you to apply. If you have any questions, please email the team at frictionlessdata@okfn.org, ask a question on the project’s gitter channel, or check out the Fellows FAQ section. Apply soon, and share with your networks!

Announcing Frictionless Data Community Virtual Hangout – 20 April

- April 16, 2020 in Frictionless Data

Photo by William White on Unsplash We are thrilled to announce we’ll be co-hosting a virtual community hangout with Datopian to share recent developments in the Frictionless Data community. This will be a 1-hour meeting where community members come together to discuss key topics in the data community. Here are some key discussions we hope to cover:
  • Introductions & share the purpose of this hangout.
  • Share the update on the new website release and general Frictionless Data related updates.
  • Have community members share their thoughts and general feedback on Frictionless Data.
  • Share information about CSV Conf.
The hangout is scheduled to happen on 20th April 2020 at 5 pm CET. If you would like to attend, you can sign up for the event in advance here. Everyone is welcome. Looking forward to seeing you there!

Tracking the Trade of Octopus (and Packaging the Data)

- March 13, 2020 in Frictionless Data, Open Knowledge

This blog is the second in a series done by the Frictionless Data Fellows, discussing how they created Frictionless Data Packages with their research data. Learn more about the Fellows Programme here http://fellows.frictionlessdata.io/. By Lily Zhao

Introduction

When I started graduate school, I was shocked to learn that seafood is actually the most internationally traded food commodity in the world. In fact, the global trade in fish is worth more than the trades of tea, coffee and sugar combined (Fisheries FAO, 2006). However, for many developing countries being connected to the global seafood market can be a double-edged sword. It is true global trade has the potential to redistribute some wealth and improve the livelihoods of fishers and traders in these countries. But it can also promote illegal trade and overfishing, which can harm the future sustainability of a local food source. Over the course of my master’s degree, I developed a passion for studying these issues, which is why I am excited to share with you my experience turning some of the data my collaborators into a packaged dataset using the Open Knowledge Foundation’s Datapackage tool. These data provide a snapshot into the global market for octopus and how it is traded throughout and between Kenya, Tanzania and Mozambique before heading to European markets. This research project was an international collaboration between the Stockholm Resilience Centre in Sweden, the National Institute for Medical Research, of Tanzania, Pwani University in Kilifi, Kenya and the School of Marine and Environmental Affairs at the University of Washington. These data eventually became my master’s thesis and this data package will complement a forthcoming publication of our findings. Specifically, these data are the prices and quantities at which middlemen in Tanzania and Kenya reported buying and selling octopus. These data are exciting because they not only inform our understanding of who is benefiting from the trade of octopus by also could assist in improving the market price octopus in Tanzania. This is because value chain information can help Tanzania’s octopus fishery along its path to Marine Stewardship Council seafood certification. Seafood that gets the Marine Stewardship Council Label gains a certain amount of credibility which in turn can increase profit. For developing countries, this seafood label can provide a monetary incentive for improving fisheries management. But before Tanzania’s octopus fishery can get certified, they will need to prove they can trace the flow of their octopus supply chain, and manage it sustainably. We hope that this packaged dataset will ultimately inform this effort.

Getting the data

To gather the data my field partner Chris and I went to 10 different fishing communities like this one. mtwara

Middlemen buy and sell seafood in Mtwara, Tanzania.

We went on to interview all the major exporters of octopus in both Tanzania and Kenya and spoke with company agents and octopus traders who bought their octopus from 570 different fishermen. With these interviews were able to account for about 95% of East Africa’s international octopus market share. Octopus

My research partner- Chris Cheupe, and I at an octopus collection point.

Creating the Data Package

The datapackage tool was created by the Open Knowledge Foundation to compile our data and metadata in a compact unit, making it easier and more efficient for others to access. You can create the data package using the online platform or using the Python or R programming software libraries. I had some issues using the R package instead of the online tool initially, which may have been related to the fact that the original data file was not utf-8 encoded. But stay tuned! For now, I made my datapackage using the Data Package Creator online tool. The tool helped me create a schema that outlines the data’s structure including a description of each column. The tool also helps you outline the metadata for the dataset as a whole, including information like the license and author. Our dataset has a lot of complicated columns and the tool gave me a streamlined way to describe each column via the schema. Afterwords, I added the metadata using the lefthand side of the browser tool and checked to make sure that the data package was valid!   valid data

The green bar at the top of the screenshot indicates validity

If the information you provide for each column does not match the data within the columns the package will not validate and instead, you will get an error like this: invalid data  

The red bar at the top of the screenshot indicates invalidity

Checkout my final datapackage by visiting my github repository!

Reference:

Fisheries, F. A. O. (2006). The state of world fisheries and aquaculture 2006.