You are browsing the archive for Daniel Fowler.

Always Already Computational Reflections

- June 21, 2017 in Frictionless Data

Always Already Computational is a project bringing together a variety of different perspectives to develop “a strategic approach to developing, describing, providing access to, and encouraging reuse of collections that support computationally-driven research and teaching” in subject areas relating to library and museum collections.  This post is adapted from my Position Statement for the initial workshop.  You can find out more about the project at https://collectionsasdata.github.io. Earlier this year, I spent two and a half days in beautiful University of California Santa Barbara at a workshop speaking with librarians, developers, and museum and library collection managers about data.  Attendees at this workshop represented a variety of respected cultural institutions including the New York Public Library, the British Library, the Internet Archive, and others. Our task was to build a collective sense of what it means to treat library and museum “collections”—the (increasingly) digitized catalogs of their holdings—as data for analysis, art, research, and other forms of re-use.  We gathered use cases and user stories in order to start the conversation on how to best publish collections for these purposes.  Look for further outputs on the project website: https://collectionsasdata.github.io .  For the moment, here are my thoughts on the experience and how it relates to work at Open Knowledge International, specifically, Frictionless Data.

Always Already Computational

Open Access to (meta)Data

The event organizers—Thomas Padilla (University of California Santa Barbara), Laurie Allen (University of Pennsylvania), Stewart Varner (University of Pennsylvania), Sarah Potvin (Texas A&M University), Elizabeth Russey Roke (Emory University), Hannah Frost (Stanford University)—took an expansive view of who should attend.  I was honored and excited to join, but decidedly new to Digital Humanities (DH) and related fields.  The event served as an excellent introduction, and I now understand DH to be a set of approaches toward interrogating recorded history and culture with the power of our current tools for data analysis, visualization, and machine learning.  As part of the Frictionless Data project at Open Knowledge International, we are building apps, libraries, and specifications that support the basic transport and description of datasets to aid in this kind of data-driven discovery.  We are trialling this approach across a variety of fields, and are interested to determine the extent to which it can improve research using library and museum collection data. What is library and museum collection data?  Libraries and museums hold physical objects which are often (although not always) shared for public view on the stacks or during exhibits.  Access to information (metadata) about these objects—and the sort of cultural and historical research dependent on such access—has naturally been somewhat technologically, geographically, and temporally restricted.  Digitizing the detailed catalogues of the objects libraries and museums hold surely lowered the overhead of day-to-day administration of these objects, but also provided a secondary public benefit: sharing this same metadata on the web with a permissive license allows a greater variety of users in the public—researchers, students of history, and others—to freely interrogate our cultural heritage in a manner they choose.   There are many different ways to share data on the web, of course, but they are not all equal.  A low impact, open, standards-based set of approaches to sharing collections data that incorporates a diversity of potential use cases is necessary.  To answer this need, many museums are currently publishing their collection data online, with permissive licensing, through GitHub: The Tate Galleries in the UK, Cooper Hewitt, Smithsonian Design Museum and The Metropolitan Museum of Art in New York have all released their collection data in CSV (and JSON) format on this popular platform normally used for sharing code.  See A Nerd’s Guide To The 2,229 Paintings At MoMA and An Excavation Of One Of The World’s Greatest Art Collections both published by FiveThirtyEight for examples of the kind of exploratory research enabled by sharing museum collection data in bulk, in a straightforward, user-friendly way.  What exactly did they do, and what else may be needed?

Packages of Museum Data

Our current funding from the Sloan Foundation enables us to focus on this researcher use case for consuming data.  Across fields, the research process is often messy, and researchers, even if they are asking the right questions, possess a varying level of skill in working with datasets to answer them.  As I wrote in my position statement:
Such data, released on the Internet under open licenses, can provide an opportunity for researchers to create a new lens onto our cultural and artistic history by sparking imaginative re-use and analysis.  For organizations like museums and libraries that serve the public interest, it is important that data are provided in ways that enable the maximum number of users to easily process it.  Unfortunately, there are not always clear standards for publishing such data, and the diversity of publishing options can cause unnecessary overhead when researchers are not trained in data access/cleaning techniques.

My experience at this event, and some research beforehand, suggested that there is a spectrum of data release approaches ranging from a basic data “dump” as conducted by the museums referenced above to more advanced, though higher investment, approaches such as publishing data as an online service with a public “API” (Application Programming Interface).  A public API can provide a consistent interface to collection metadata, as well as an ability to request only the needed records, but comes at the cost of having the nature of the analysis somewhat preordained by its design.  In contrast, in the data dump approach, an entire dataset, or a coherent chunk of it, can be easier for some users to access and load directly into a tool like R (see this UK Government Digital Service post on the challenges of each approach) without needing advanced programming.  As a format for this bulk download, CSV is the best choice as the MoMa reflected when releasing their collection data online:
CSV is not just the easiest way to start but probably the most accessible format for a broad audience of researchers, artists, and designers.  
This, of course, comes at the cost of not having a less consistent interface for the data, especially in the case of the notoriously underspecified CSV format.  The README file will typically go into some narrative detail about how to best use the dataset, some expected “gotchas” (e.g. “this UTF-8 file may not work well with Excel on a Mac”).  It might also list the columns in a tabular data file stored in the dataset, expected types and formats for values in each column (e.g. the date_acquired column should, hopefully, contain dates in a one or another international format).  This information is critical for actually using the data, and the automated export process that generates the public collection dataset from the museum’s internal database may try to ensure that the data matches expectations, but bugs exist, and small errors may go unnoticed in the process. The Data Package descriptor (described in detail on our specifications site), used in conjunction with Data Package-aware tooling, is meant to somewhat restore the consistent interface provided by an API by embedding this “schema” information with the data.  This allows the user or the publisher to check that the data conforms to expectations without requiring modification of the data itself: a “packaged” CSV can still be loaded into Excel as-is (though without the benefit of type checking enabled by the Data Package descriptor).  The Carnegie Museum of Art, in its release of its collection data, follows the examples set by the Tate, the Met, the Moma, and Cooper-Hewitt as described above, but opted to also include a Data Package descriptor file to help facilitate online validation of the dataset through tools such as Good Tables.  As tools come online for editing, validating, and transforming Data Packages, users of this dataset should be able to benefit from those, too: http://frictionlessdata.io/tools/. We are a partner in the Always Already Computational: Collections as Data project, and as part of this work, we are working with Carnegie Museum of Art to provide a more detailed look at the process that went into the creation of the CMOA dataset, as well as sketching a potential ways in which the Data Package might help enable re-use of this data.  In the meantime, check out our other case studies on the use of Data Package in fields as diverse as ecology, cell migration, and energy data: http://frictionlessdata.io/case-studies/ Also, pay your local museum or library a visit.

csv,conf,v3

- May 30, 2017 in Events, Frictionless Data, OD4D, Open Spending

The third manifestation of everyone’s favorite community conference about data—csv,conf,v3—happened earlier this May in Portland, Oregon. The conference brought together data makers/doers/hackers from various backgrounds to share knowledge and stories about data in a relaxed, convivial, alpaca-friendly (see below) environment. Several Open Knowledge International staff working across our Frictionless Data, OpenSpending, and Open Data for Development projects made the journey to Portland to help organize, give talks, and exchange stories about our lives with data. Thanks to Portland and the Eliot Center for hosting us. And, of course, thanks to the excellent keynote speakers Laurie Allen, Heather Joseph, Mike Bostock, and Angela Bassa who provided a great framing for the conference through their insightful talks. Here’s what we saw.

Talks We Gave

The first priority for the team was to present on the current state of our work and Open Knowledge International’s mission more generally. In his talk, Continuous Data Validation for Everybody, developer Adrià Mercader updated the crowd on the launch and motivation of goodtables.io: It was a privilege to be able to present our work at one of my favourite conferences. One of the main things attendees highlight about csv,conf is how diverse it is: many different backgrounds were represented, from librarians to developers, from government workers to activists. Across many talks and discussions, the need to make published data more useful to people came up repeatedly. Specifically, how could we as a community help people publish better quality data? Our talk introducing goodtables.io presented what we think will be a dominant approach to approaching this question: automated validation. Building on successful practices in software development like automated testing, goodtables.io integrates within the data publication process to allow publishers to identify issues early and ensure data quality is maintained over time. The talk was very well received, and many people reached out to learn more about the platform. Hopefully, we can continue the conversation to ensure that automated (frictionless) data validation becomes the standard on all data publication workflows. David Selassie Opoku presented When Data Collection Meets Non-technical CSOs in Low-Income Areas: csv,conf was a great opportunity to share highlights of the OD4D (and School of Data) team’s data collection work. The diverse audience seemed to really appreciate insights on working with non-technical CSOs in low-income areas to carry out data collection. In addition to highlighting the lessons from the work and its potential benefit to other regions of the world, I got to connect with data literacy organisations such as Data Carpentry who are currently expanding their work in Africa and could help foster potential data literacy training partnerships. As a team working with CSOs in low-income areas like Africa, School of Data stands to benefit from continuing conversations with data “makers” in order to present potential use cases. A clear example I cited in my talk was Kobo Toolbox, which continues to mitigate several daunting challenges of data collection through abstraction and simple user interface design. Staying in touch with the csv,conf community may highlight more such scenarios which could lead to the development of new tools for data collection. Paul Walsh, in his talk titled Open Data and the Question of Quality (slides) talked about lessons learned from working on a range of government data publishing projects and we can do as citizens to demand better quality data from our governments:

Talks We Saw

Of course, we weren’t there only to present; we were there to learn from others as well. Before the conference, through our Frictionless Data project, we have been lucky to be in contact with various developers and thinkers around the world who also presented talks at the conference. Eric Busboom presented Metatab, an approach to packaging metadata in spreadsheets. Jasper Heefer of Gapminder talked about DDF, a data description format and associated data pipeline tool to help us live a more fact-based existence. Bob Gradeck of the Western Pennsylvania Regional Data Center talked about data intermediaries in civic tech, a topic near and dear to our hearts here at Open Knowledge International.

Favorite Talks

Paul’s:
  • “Data in the Humanities Classroom” by Miriam Posner
  • “Our Cities, Our Data” by Kate Rabinowitz
  • “When Data Collection Meets Non-technical CSOs in Low Income Areas” by David Selassie Opoku
David’s:
  • “Empowering People By Democratizing Data Skills” by Erin Becker
  • “Teaching Quantitative and Computational Skills to Undergraduates using Jupyter Notebooks” by Brian Avery
  • “Applying Software Engineering Practices to Data Analysis” by Emil Bay
  • “Open Data Networks with Fieldkit” by Eric Buth
Jo’s:
  • “Smelly London: visualising historical smells through text-mining, geo-referencing and mapping” by Deborah Leem
  • “Open Data Networks with Fieldkit” by Eric Buth
  • “The Art and Science of Generative Nonsense” Mouse Reeve
  • “Data Lovers in in a Dangerous Time” by Bendan O’Brien

Data Tables

This csv,conf was the first csv,conf to have a dedicated space for working with data hands-on. In past events, attendees left with their heads buzzing full of new ideas, tools, and domains to explore but had to wait until returning home to try them out. This time we thought: why wait? During the talks, we had a series of hands-on workshops where facilitators could walk through a given product and chat about the motivations, challenges, and other interesting details you might not normally get to in a talk. We also prepared several data “themes” before the conference meant to bring people together on a specific topic around data. In the end, these themes proved a useful starting point for several of the facilitators and provided a basis for a discussion on cultural heritage data following on from a previous workshop on the topic. The facilitated sessions went well. Our own Adam Kariv walked through Data Package Pipelines, his ETL tool for data based on the Data Package framework. Jason Crawford demonstrated Fieldbook, a tool for managing easily managing a database in-browser as you would a spreadsheet. Bruno Vieira presented Bionode, going into fascinating detail on the mechanics of Node.js Streams. Nokome Bentley walked through a hands-on introduction to accessible, reproducible data analysis using Stencila, a way to create interactive, data-driven documents using the language of your choice to enable reproducible research. Representatives from data.world, an Austin startup we worked with on an integration for Frictionless Data also demonstrated uploading datasets to data.world. The final workshop was conducted by several members of the Dat team, including co-organizer Max Ogden, with a super enthusiastic crowd. Competition from the day’s talks was always going to be fierce, but it seems that many attendees found some value in the more intimate setting provided by Data Tables.

Thanks

If you were there at csv,conf in Portland, we hope you had a great time. Of course, our thanks go to the Gordon and Betty Moore Foundation and to Sloan Foundation for enabling me and my fellow organizers John Chodacki, Max Ogden, Martin Fenner, Karthik, Elaine Wong, Danielle Robinson, Simon Vansintjan, Nate Goldman and Jo Barratt who all put so much personal time and effort to bringing this all together. Oh, and did I mention the Comma Llama Alpaca? You, um, had to be there.

Data Tables at csv,conf,v3

- May 1, 2017 in Open Knowledge

You may have heard that csv,conf,v3 is happening again this year, May 2nd and 3rd, in Portland, Oregon! We have a really great line-up of speakers from across the world on a set of topics ranging from the democratization of data analysis and election mapping to the very timely subject of emoji data science (?). But what’s a conference without a workshop or two? To answer that rhetorical question, we are excited to announce a stream at the csv,conf,v3 called Data Tables (data-tables.csv), and everyone is invited!

Data Tables

csv,conf,v3 is the place to hear stories about data sharing and data analysis from science, journalism, government, and open source. In past events, attendees left with their heads buzzing full of new ideas, tools, and data domains to explore but had to wait until returning home to try them out. This time we thought: why wait? Data Tables (data-tables.csv) is an open hacking space at csv,conf,v3 for hands-on, collaborative data work with a mix between facilitator-led sessions and sessions loosely organized around data themes, for example, agricultural or cultural heritage data. The idea is to learn about a tool, hack on a problem, or explore a new dataset in a group setting. We are running sessions (each about 1.5 hours long) concurrently with the talks; csv,conf,v3 attendees are free to drop by a Data Tables session of their choosing and take off in time for their next talk.

Data Tables is tables, one facilitator-led, and the rest organized around data themes. There will be a series of sessions at the facilitator-led table over the course of the conference, at least one each morning and one each afternoon, which will seek to introduce interested attendees to a given technology, tool, or platform. For the other tables, we hope to enable people to come and do some hacking on a real data problem, discover some new or novel tools for those problems, without a hackathon-like commitment of producing some working thing after these 2 days. We expect people to come and go, but have fun and interact over a real dataset along the way.  We are coordinating on the data-tables.csv repo on GitHub

Participants in last year’s csv,conf pre-event workshop

If you are passionate about data and the applications it has in our society, follow @csvconference this week on Twitter for updates.  For questions, please email csv-conf-coord@googlegroups.com or join the public slack channel https://csvconf-slackin.herokuapp.com/.

Excel is threatening the quality of research data — Data Packages are here to help

- February 20, 2017 in Frictionless Data

This week the Frictionless Data team at Open Knowledge International will be speaking at the International Digital Curation Conference #idcc17 on making research data quality visible. Here, Dan Fowler looks at why the popular file format Excel is problematic for research and what steps can be taken to ensure data quality is maintained throughout the research process. Our Frictionless Data project aims to make sharing and using data as easy and frictionless as possible by improving how data is packaged.The project is designed to support the tools and file formats researchers use in their everyday work, including basic CSV files and popular data analysis programming languages and frameworks like R and Python Pandas.  However, Microsoft Excel, both the application and the file format, remains very popular for data analysis in scientific research. It is easy to see why Excel retains its stranglehold: over the years, an array of convenience features for visualizing, validating, and modeling data have been developed and adopted across a variety of uses.  Simple features, like the ability to group related tables together, is a major advantage of the Excel format over, for example, single-table formats like CSV.  However, Excel has a well documented history of silently corrupting data in unexpected ways which leads some, like data scientist Jenny Bryan, to compile lists of “Scary Excel Stories” advising researchers to choose alternative formats, or at least, treat data stored in Excel warily.

“Excel has a well-documented history of silently corrupting data in unexpected ways…”

With data validation and long-term preservation in mind, we’ve created Data Packages which provide researchers an alternative format to Excel by building on simpler, well understood text-based file formats like CSV and JSON and adding advanced features.  Added features include providing a framework for linking multiple tables together; setting column types, constraints, and relations between columns; and adding high-level metadata like licensing information.  Transporting research data with open, granular metadata in this format, paired with tools like Good Tables for validation, can be a safer and more transparent option than Excel.

Why does open, granular metadata matter?

With our “Tabular” Data Packages, we focus on packaging data that naturally exists in “tables”—for example, CSV files—a clear area of importance to researchers illustrated by guidelines issued by the Wellcome Trust’s publishing platform Wellcome Open Research. The guidelines mandate:
Spreadsheets should be submitted in CSV or TAB format; EXCEPT if the spreadsheet contains variable labels, code labels, or defined missing values, as these should be submitted in SAV, SAS or POR format, with the variable defined in English.

Guidelines like these typically mandate that researchers submit data in non-proprietary formats; SPSS, SAS, and other proprietary data formats are accepted due to the fact they provide important contextual metadata that haven’t been supported by a standard, non-proprietary format. The Data Package specifications—in particular, our Table Schema specification—provide a method of assigning functional “schemas” for tabular data.  This information includes the expected type of each value in a column (“string”, “number”, “date”, etc.), constraints on the value (“this string can only be at most 10 characters long”), and the expected format of the data (“this field should only contain strings that look like email addresses). The Table Schema can also specify relations between tables, strings that indicate “missing” values, and formatting information. This information can prevent incorrect processing of data at the loading step.  In the absence of these table declarations, even simple datasets can be imported incorrectly in data analysis programs given the heuristic (and sometimes, in Excel’s case, byzantine) nature of automatic type inference.  In one example of such an issue, Zeeberg et al. and later Ziemann, Eren and El-Osta describe a phenomenon where gene expression data was silently corrupted by Microsoft Excel:
A default date conversion feature in Excel (Microsoft Corp., Redmond, WA) was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] [3] was being converted to ’1-DEC.’ [16]
These errors didn’t stop at the initial publication.  As these Excel files are uploaded to other databases, these errors could propagate through data repositories, an example of which took place in the now replaced “LocusLink” database. In a time where data sharing and reproducible research is gaining traction, the last thing researchers need is file formats leading to errors.
Much like Boxed Water, Packaged Data is better because it is easier to move.
Zeeberg’s team described various technical workarounds to avoid Excel problems, including using Excel’s text import wizard to manually set column types every time the file is opened.  However, the researchers acknowledge that this requires constant vigilance to prevent further errors, attention that could be spent elsewhere.   Rather, a simple, open, and ubiquitous method to unambiguously declare types in column data—columns containing gene names (e.g. “DEC1”) are strings not dates and “RIKEN identifiers” (e.g. “2310009E13”) are strings not floating point numbers—paired with an Excel plugin that reads this information may be able to eliminate the manual steps outlined above.

Granular Metadata Standards Allow for New Tools & Integrations

By publishing this granular metadata with the data, both users and software programs can use it to automatically import into Excel, and this benefit also accrues when similar integrations are created for other data analysis software packages, like R and Python.  Further, these specifications (and specifications like them) allow for the development of whole new classes of tools to manipulate data without the overhead of Excel, while still including data validation and metadata creation. For instance, the Open Data Institute has created Comma Chameleon, a desktop CSV editor.  You can see a talk about Comma Chameleon on our Labs blog.  Similarly, Andreas Billman created SmartCSV.fx to solve the issue of broken CSV files provided by clients.  While initially this project depended on an ad hoc schema for data, the developer has since adopted our Table Schema specification. Other approaches that bring spreadsheets together with Data Packages include Metatab which aims to provide a useful standard, modeled on the Data Package, of storing metadata within spreadsheets.  To solve the general case of reading Data Packages into Excel, Nimble Learn has developed an interface for loading Data Packages through Excel’s Power Query add-in. For examples of other ways in which Excel mangles good data, it is worth reading through Quartz’s Bad Data guide and checking over your data.  Also, see our Frictionless Data Tools and Integrations page for a list of integrations created so far.   Finally, we’re always looking to hear more user stories for making it easier to work with data in whatever application you are using. This post was adapted from a paper we will be presenting at International Digital Curation Conference (IDCC) where our Jo Barratt will be presenting our work to date on Making Research Data Quality Visible .

International Data Week: From Big Data to Open Data

- October 11, 2016 in Frictionless Data, Open Data, Open Knowledge, Open Research, Open Science, Open Standards, Small Data

Report from International Data Week: Research needs to be reproducible, data needs to be reusable and Data Packages are here to help. International Data Week has come and gone. The theme this year was ‘From Big Data to Open Data: Mobilising the Data Revolution’. Weeks later, I am still digesting all the conversations and presentations (not to mention, bagels) I consumed over its course. For a non-researcher like me, it proved to be one of the most enjoyable conferences I’ve attended with an exciting diversity of ideas on display. In this post, I will reflect on our motivations for attending, what we did, what we saw, and what we took back home. idw

Three conferences on research data

International Data Week (11-17 September) took place in Denver, Colorado and consisted of three co-located events: SciDataCon, International Data Forum, and the Research Data Alliance (RDA) 8th Plenary. Our main motivation for attending these events was to talk directly with researchers about Frictionless Data, our project oriented around tooling for working with Data “Packages”, an open specification for bundling related data together using a standardized JSON-based description format. The concepts behind Frictionless Data were developed through efforts at improving workflows for publishing open government data via CKAN. Thanks to a generous grant from the Sloan Foundation, we now have the ability to take what we’ve learned in civic tech and pilot this approach within various research communities. International Data Week provided one the best chances we’ve had so far to meet researchers attempting to answer today’s most significant challenges in managing research data. It was time well spent: over the week I absorbed interesting user stories, heard clearly defined needs, and made connections which will help drive the work we do in the months to come.

What are the barriers to sharing research data?

While our aim is to reshape how researchers share data through better tooling and specifications, we first needed to understand what non-technical factors might impede that sharing. On Monday, I had the honor to chair the second half of a session co-organized by Peter Fitch, Massimo Craglia, and Simon Cox entitled Getting the incentives right: removing social, institutional and economic barriers to data sharing. During this second part, Wouter Haak, Heidi Laine, Fiona Murphy, and Jens Klump brought their own experiences to bear on the subject of what gets in the way of data sharing in research. _MG_3504 Mr. Klump considered various models that could explain why and under what circumstances researchers might be keen to share their data—including research being a “gift culture” where materials like data are “precious gifts” to be paid back in kind—while Ms. Laine presented a case study directly addressing a key disincentive for sharing data: fears of being “scooped” by rival researchers. One common theme that emerged across talks was the idea that making it easier to credit researchers for their data via an enabling environment for data citation might a be a key factor in increasing data sharing. An emerging infrastructure for citing datasets via DOIs (Digital Object Identifiers) might be part of this. More on this later.

“…making it easier to credit researchers for their data via an enabling environment for data citation might a be a key factor in increasing data sharing”

What are the existing standards for research data?

For the rest of the week, I dove into the data details as I presented at sessions on topics like “semantic enrichment, metadata and data packaging”, “Data Type Registries”, and the “Research data needs of the Photon and Neutron Science community”. These sessions proved invaluable as they put me in direct contact with actual researchers where I learned about the existence (or in some cases, non-existence) of community standards for working with data as well as some of the persistent challenges. For example, the Photon and Neutron Science community has a well established standard in NeXus for storing data, however some researchers highlighted an unmet need for a lightweight solution for packaging CSVs in a standard way. Other researchers pointed out the frustrating inability of common statistical software packages like SPSS to export data into a high quality (e.g. with all relevant metadata) non-proprietary format as encouraged by most data management plans. And, of course, a common complaint throughout was the amount of valuable research data locked away in Excel spreadsheets with no easy way to package and publish them. These are key areas we are addressing now and in the coming months with Data Packages.

Themes and take-home messages

The motivating factor behind much of the infrastructure and standardization work presented was the growing awareness of the need to make scientific research more reproducible, with the implicit requirement that research data itself be more reusable. Fields as diverse as psychology and archaeology have been experiencing a so-called “crisis” of reproducibility. For a variety of reasons, researchers are failing to reproduce findings from their own or others’ experiments. In an effort to resolve this, concepts like persistent identifiers, controlled vocabularies, and automation played a large role in much of the current conversation I heard.

…the growing awareness of the need to make scientific research more reproducible, with the implicit requirement that research data itself be more reusable”

_MG_3511

Persistent Identifiers

Broadly speaking, persistent identifiers (PIDs) are an approach to creating a reference to a digital “object” that (a) stays valid over long periods of time and (b) is “actionable”, that is, machine-readable. DOIs, mentioned above and introduced in 2000, are a familiar approach to persistently identifying and citing research articles, but there is increasing interest in applying this approach at all levels of the research process from researchers themselves (through ORCID) to research artifacts and protocols, to (relevant to our interests) datasets. We are aware of the need to address this use case and, in coordination with our new Frictionless Data specs working group, we are working on an approach to identifiers on Data Packages.

Controlled Vocabularies

Throughout the conference, there was an emphasis on ensuring that records in published data incorporate some idea of semantic meaning, that is, making sure that two datasets that use the same term or measurement actually refer to the same thing by enforcing the use of a shared vocabulary. Medical Subject Headings (MeSH) from the United States National Library of Medicine is a good example of a standard vocabulary that many datasets use to consistently describe biomedical information. While Data Packages currently do not support specifying this type of semantic information in a dataset, the specification is not incompatible with this approach. As an intentionally lightweight publishing format, our aim is to keep the core of the specification as simple as possible while allowing for specialized profiles that could support semantics.

Automation

There was a lot of talk about increasing automation around data publishing workflows. For instance, there are efforts to create “actionable” Data Management Plans that help researchers walk through describing, publishing and archiving their data. A core aim of the Frictionless Data tooling is to automate as many elements of the data management process as possible. We are looking to develop simple tools and documentation for preparing datasets and defining schemas for different types of data so that the data can, for instance, be automatically validated according to defined schemas.

Making Connections

Of course, one of the major benefits of attending any conference was the chance to meet and interact with other research projects. For instance, we had really great conversations with Mackenzie DataStream project, a really amazing project for sharing and exploring water data in the Mackenzie River Basin in Canada. The technology behind this project already uses the Data Packages specifications, so look for a case study on the work done here on the Frictionless Data site soon. img_0350 There is never enough time in one conference to meet all the interesting people and explore all the potential opportunities for collaboration. If you are interested in learning more about our Frictionless Data project or would like to get involved, check out the links below. We’re always looking for new opportunities to pilot our approach. Together, hopefully, we can make reduce the friction in managing research data.