You are browsing the archive for Jo Barratt.

The UK must not be left behind on the road to a more open society

- August 3, 2020 in Open Data, Open Knowledge Foundation, Open Legislation, Policy

The United Kingdom still doesn’t have a National Data Strategy. The idea has been stuck in development hell for years, and the delay has already had an impact. Had a strategy been in place before the coronavirus pandemic, there would have been rules and guidelines in place to help the sharing of data and information between organisations like, for example the Department of Health and Social Care and the NHS. A recent opinion poll for the Open Knowledge Foundation found that nearly two-thirds of people in the UK believe a government data strategy would have helped in the fight against COVID-19. Just over a year ago, we submitted a written submission to the UK Government’s consultation on the National Data Strategy, which can be read here. We stressed that the UK National Data Strategy must emphasise the importance and value of sharing more, better quality information and data openly in order to make the most of the world-class knowledge created by our institutions and citizens. Without this, we warned, businesses, individuals and public bodies would not be able to play a full role in the interconnected world of today and tomorrow. Allowing people to make better decisions and choices informed by data will boost the UK’s economy through greater productivity, but not without the necessary investment in skills. Our proposals included:
  • A data literacy training programme open to local communities to ensure UK workers have the skills for the technological jobs of the future.
  • Greater use of open licences, granting the general public rights to reuse, distribute, combine or modify works that would otherwise be restricted under intellectual property laws.
With a clear commitment from the Government, the UK has an opportunity to be at the forefront of a global future that is fair, free and open. Inevitably, the coronavirus pandemic has disrupted the work of government. But a parliamentary question from Labour MP Ian Murray, Shadow Secretary of State for Scotland, has revealed the government still ‘aims’ to publish the strategy in 2020. It’s disappointing that this is not a cast-iron commitment, although it is certainly a target that we hope will be achieved, not least because at the end of this year the Brexit transition period comes to an end and there are serious question to be addressed about the post-Brexit landscape in the UK. Last year, an updated directive on open data and the re-use of public sector information was entered into force by the European Commission. As part of this directive, EU member states – which at the time included the UK – agreed that a list of ‘high-value’ datasets would be drawn up to be provided free of charge. These high-value datasets will fit into the following categories:
  • Geospatial
  • Earth observation and environment
  • Meteorological
  • Statistics
  • Companies and company ownership
  • Mobility
A research team is currently working to create this list of high-value datasets, with the aim of publishing a draft report by September 2020. An Implementing Act is due to be placed before the European Commission for approval in 2021 and EU Member States have until July 2021 to make sure that these datasets are available as open data and published via APIs. What we don’t know is if the UK Government will adopt these same datasets to help business and civil society create new opportunities post-Brexit, and in a COVID-19 landscape. Another parliamentary question from Ian Murray asked this, but the answer doesn’t commit the government to following suit. The question was answered by the Minister of State for Media and Data, but it was announced earlier this month that the Prime Minister has taken away responsibility for the government use of data from the Department for Digital, Culture, Media and Sport and handed it to the Cabinet Office. What happens next will therefore be of huge interest to all of us who work to promote open data. This week the European Commission published a roadmap on the digital economy and society. It is vital the UK is not left behind on the road to a more open society.

Exposing legacy project datasets in Digital Humanities: King’s Digital Lab experience

- July 22, 2020 in Labs, Open Research

exposing-datasets This is a repost of a blog published by Arianna Ciula on Kings Digital Lab Blog on July 7, 2020 In this blogpost we share our experiences at King’s Digital Lab (KDL). While we can call the process a success overall (and you can read more about it in this article and in the summary of our current archiving and sustainability approach), the road has been bumpy and we stumbled across some interesting challenges along the way. In this blogpost we talk about how we made use of the Open Knowledge Foundation’s open source data portal platform CKAN to catalogue and make visible some of our legacy projects’ data.
KDL adopted CKAN following assessment of the institutional repository in place at the time as well as comparisons of research data management platforms in the literature (e.g. on ‘data FAIRification’ see van Erp, J. A. et al. 2018). While this is a solution that might encounter changes over time (including data migration or mapping to and aggregation in other repositories), at the moment it is fit for purpose in that it provides a metadata catalogue to store or to point to some of our legacy projects datasets – and associated contextual documentation – which were not accessible before, expanding substantially the potential for data and resources to be discovered, re-used and critiqued.
First things first, a step back to what KDL is about and what the data we inherited and produced entail. KDL builds on a recent yet relatively long history – for the field of Digital Humanities – of creating tools and web resources in collaboration with researchers in the arts and humanities as well as the cultural heritage sector. While KDL started operation as a team of Research Software Engineers within the Faculty of the Arts and Humanities at King’s College London (UK) in 2015, some of the projects we inherited were developed 5, 10 or even 20 years before the Lab’s existence. Out of the ca. 100 legacy projects, some started in the late 1990s or early 2000s out of many collaborative projects led or co-led by the Department of Digital Humanities (DDH). The tools and resources KDL inherited span a wide spectrum from text analysis and annotation tools, digital corpora of texts, images and musical scores to digital editions, historical databases and layered maps.
The resources you will find in the KDL CKAN instance aren’t numerous but our plan is to increase their number as further support is obtained for a project undertaken by KDL (in particular with the involvement of Samantha CallaghanPaul CatonArianna CiulaNeil JakemanBrian MaherPam MellenMiguel VieiraTim Watts) in collaboration with colleagues and students at the Department of Digital Humanities (Paul SpenceKristen Shuster, Minmin Yu). This work was a continuation of the wider archiving and sustainability effort described in Smithies et al. 2019 article and was possible through a seed fund grant offered by DDH and complemented by a student internship on the MA in Digital Humanities.
The datasets and resources we collected and catalogued range from summarised (so called ‘calendared’) editions of medieval documents to collections of modern correspondence, from ontologies adapted to express complex entities and relations in medieval documents to corresponding guides for encoding and data modelling. The default CKAN mask mapped to international cataloguing standards allows the capture of important dataset information such as creator and maintainer details, version etc. However, given that our legacy datasets are mainly project-based, we also decided to enhance the catalogue with project metadata (see related code at our github repository) ranging from information about the collaborative teams (typically including academics, archivists, designers, software engineers, analysts) to details on funders and period of activity. This slight modification of the data ingest form was then re-used in a currently active project – MaDiH (مديح): Mapping Digital Cultural Heritage in Jordan – which is looking at scoping the landscape of Jordanian cultural heritage datasets and also opted for a KDL-hosted CKAN instance as its core solution architecture (the code associated to other MaDiH-specific CKAN extensions including detailed tagging for time periods and data types is available at this other github repository).
CKAN mask first part

The CKAN mask for KDL instance (part 1)
CKAN mask entry second part

The CKAN mask for KDL instance (part 2)
What follows is the workflow we implemented for our cataloguing project:
  1. Dataset and resource selection
  2. Preliminary data entries by analyst and/or supervised student
  3. Internal peer review
  4. Communications with partners providing project overview, outline of benefits, some technical details (information on CKAN, list of resources to be exposed, license for the data, preview details) and requesting consent
  5. Data publication (if consent obtained)
  6. Public comms and dissemination (e.g. on social media)
  7. Creation of Digital Object Identifier at dataset level via DataCite membership of King’s College London library
  8. Update of citation field when DOI obtained
With respect to point 1 and 2, the cataloguing information provided for each project (equivalent to dataset in CKAN parlance) is rather high level; however, even at this rather minimally functional level, more often than not, digging into legacy documentation is not trivial and requires making tacit knowledge within the lab explicit or contacting partners to elicit further context, information and rationale for resource selection and ingestion. For example, despite KDL legacy projects being informed by best practices in digital humanities such as use of standards and general openness to data re-use, licences were not always agreed at the time of data creation, in some cases leaving room for interpretation or substantial discussions regarding data ownership and exposure. In addition, in academic research, even when projects are long completed and unfunded, often collected and created data continue to be manipulated and analysed to inform further research and new arguments. While we had agreed to expose data which were considered ‘complete’, often multiple versions of the ‘same’ resource co-exist (adequately time-stamped or contextualised in narrative form) to showcase the constructed nature of this material and their workflows.
Data exposure and publication has now become a key element in King’s Digital Lab’s approach to project development as well as to our archiving and sustainability model. Dataset deposit within the Lab or as part of institutional technical systems as well as external repositories is an option assessed at several stages of a project lifecyle, from initial conversations with project partners when discussing a new project idea to post-funding phase and maintenance of legacy projects (see more on our approach on this guidance to research data management). Data publication on the KDL CKAN instance addresses mainly the issue of hidden datasets for our legacy projects at the moment; however, cataloguing projects metadata and exposing project datasets via CKAN is one of the options KDL currently offers also to new project partners.
Not only can shifting from systems to data ease the maintenance burden of many long-running projects, but it opens up possibilities for data re-use, verification and integration beyond siloed resources. Data exposure is however not enough to ensure access, and should not mask the need for attention to standards, workflows, systems and services (see recent ALLEA report on “Sustainable and FAIR Data Sharing in the Humanities”). This is where attention to tailored project solutions to research questions and domains while at the same time attempting to align to existing community standards within the Linked Open data paradigm continues to be a challenging yet fruitful area of research and ongoing activities at KDL. For example, our research software engineers are currently working towards integrating the web framework application most used in KDL’s technical stack – Django – with relevant APIs to align to specific standards (e.g. bibliographic RDF data models; Linked Open Data resources for people and location entities) or to extend them as needed with project code published on relevant software repositories (see https://github.com/kingsdigitallab/) under an open licence. Sleeves up as there is a lot of work still to be done…

OKI wins funds from ODI to create Open Data publication toolkit

- October 31, 2017 in Data Quality, Frictionless Data, News, ODI

Open Knowledge International (OKI) has been awarded funds by the Open Data Institute (ODI) as part of a project to enhance and increase adoption of tools and services for open data publishers in the private and public sectors, reducing barriers to publication. OKI’s focus in this programme will be to create better open data publication workflows by building on our earlier work on the Frictionless Data initiative. We will be implementing significant incremental improvements to a range of code libraries and tools that are loosely aligned around our Frictionless Data project, in which we are working on removing the friction in working with data by developing a set of tools, standards, and best practices for publishing data. The work will be presented as part of a new toolkit which will be specifically targeted at both technical and non-technical users of data, within the public sector, businesses, and the data community. We will perform additional user research in government and non-governmental contexts, design and enhance user interfaces for non-technical users, implement integrations of tooling with existing workflows as well as working towards new ones. The reports, research and tools produced will become practical assets that can be used and added to by others, to continue to explore how data can and should work in our societies and economies. Innovate UK, the UK’s innovation agency, is providing £6 million over three years to the ODI, to advance knowledge and expertise in how data can shape the next generation of public and private services, and create economic growth. The work on improving the conditions for data publishing is one of six projects, chosen by the ODI, in this first year of the funding. Olivier Thereaux, Head of Technology at the ODI said:
‘Our goals in this project are to truly understand what barriers exist to publishing high quality data quickly and at reasonable cost. We’re happy to be working with OKI, and to be building on its Frictionless Data initiative to further the development of simpler, faster, higher quality open data publishing workflows. ‘

On announcing the funding on 17th October, Dr Jeni Tennison, CEO at the ODI said:
‘The work we are announcing today will find the best examples of things working well, so we can share and learn from them. We will take these learnings and help businesses and governments to use them and lead by example.’
A major focus for the Product Team at Open Knowledge International over the last two years has been around data quality and automation of data processing. Data quality is arguably the greatest barrier to useful and usable open data and we’ve been directly addressing this via specifications and tooling in Frictionless Data over the last two years. Our focus in this project will be to develop ways for non-technical users to employ tools for automation, reducing the potential for manual error, and increasing productivity. We see speed of publication and lowering costs of publication as two areas that are directly enhanced by having better tooling and workflows to address quality and automation and this is something which the development of this toolkit will directly address. People are fundamental to quality, curated, open data publication workflows. However, by automating more aspects of the “publication pipeline”, we not only reduce the need for manual intervention, we also can increase the speed at which open data can be published.

To keep up to date on our progress, join the Frictionless Data Discuss forum, or ask the team a direct question on the gitter channel.

Frictionless Data Case Study: data.world

- April 11, 2017 in Frictionless Data

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. The heart of Frictionless Data is the Data Package standard, a containerization format for any kind of data based on existing practices for publishing open-source software. We’re curious to learn about some of the common issues users face when working with data. In our Case Study series, we are highlighting projects and organisations who are working with the Frictionless Data specifications and tooling in interesting and innovative ways. For this case study, we interviewed Bryon Jacob of data.world. More case studies can be found at http://frictionlessdata.io/case-studies.

How do you use the Frictionless Data specs and what advantages did you find in using the Data Package approach?

We deal with a great diversity of data, both in terms of content and in terms of source format – most people working with data are emailing each other spreadsheets or CSVs, and not formally defining schema or semantics for what’s contained in these data files. When data.world ingests tabular data, we “virtualize” the tables away from their source format, and build layers of type and semantic information on top of the raw data. What this allows us to do is to produce a clean Tabular Data Package[^Package] for any dataset, whether the input is CSV files, Excel Spreadsheets, JSON data, SQLite Database files – any format that we know how to extract tabular information from – we can present it as cleaned-up CSV data with a datapackage.json that describes the schema and metadata of the contents. Available Data

What else would you like to see developed?

Graph data packages, or “Universal Data Packages” that can encapsulate both tabular and graph data. It would be great to be able to present tabular and graph data in the same package and develop tools that know how to use these things together. To elaborate on this, it makes a lot of sense to normalize tabular data down to clean, well-formed CSVs.or data that more graph-like, it would also make sense to normalize it to a standard format. RDF is a well-established and standardized format, with many serialized forms that could be used interchangeably (RDF XML, Turtle, N-Triples, or JSON-LD, for example). The metadata in the datapackage.json would be extremely minimal, since the schema for RDF data is encoded into the data file itself. It might be helpful to use the datapackage.json descriptor to catalog the standard taxonomies and ontologies that were in use, for example it would be useful to know if a file contained SKOS vocabularies, or OWL classes.

What are the next things you are going to be working on yourself?

We want to continue to enrich the metadata we include in Tabular Data Packages exported from data.world, and we’re looking into using datapackage.json as an import format as well as export.

How do the Frictionless Data specifications compare to existing proprietary and nonproprietary specifications for the kind of data you work with?

data.world works with lots of data across many domains – what’s great about the Frictionless Data specs is that it’s a lightweight content standard that can be a starting point for building domain-specific content standards – it really helps with the “first mile” of standardising data and making it interoperable. Available Data

What do you think are some other potential use cases?

In a certain sense, a Tabular Data Package is sort of like an open-source, cross-platform, accessible replacement for spreadsheets that can act as a “binder” for several related tables of data. I could easily imagine web or desktop-based tools that look and function much like a traditional spreadsheet, but use Data Packages as their serialization format.

Who else do you think we should speak to?

Data science IDE (Interactive Development Environment) producers – RStudio, Rodeo (python), anaconda, Jupyter – anything that operates on Data Frames as a fundamental object type should provide first-class tool and API support for Tabular Data Packages.

What should the reader do after reading this Case Study?

To read more about Data Package integration at data.world, read our post: Try This: Frictionless data.world. Sign up, and starting playing with data.   Have a question or comment? Let us know in the forum topic for this case study.

Frictionless Data Case Study: John Snow Labs

- April 6, 2017 in Frictionless Data

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. The heart of Frictionless Data is the Data Package standard, a containerization format for any kind of data based on existing practices for publishing open-source software. We’re curious to learn about some of the common issues users face when working with data. In our Case Study series, we are highlighting projects and organisations who are working with the Frictionless Data specifications and tooling in interesting and innovative ways. For this case study, we interviewed Ida Lucente of John Snow Labs. More case studies can be found at http://frictionlessdata.io/case-studies.

What does John Snow Labs do?

John Snow Labs accelerates data science and analytics teams, by providing clean, rich and current data sets for analysis. Our customers typically license between 50 and 500 data sets for a given project, so providing both data and metadata in a simple, standard format that is easily usable with a wide range of tools is important.

What are the challenges you face working with data?

Each data set we license is curated by a domain expert, which then goes through both an automated DataOps platform and a manual review process. This is done in order to deal with a string of data challenges. First, it’s often hard to find the right data sets for a given problem. Second, data files come in different formats, and include dirty and missing data. Data types are inconsistent across different files, making it hard to join multiple data sets in one analysis. Null values, dates, currencies, units and identifiers are represented differently. Datasets aren’t updated on a standard or public schedule, which often requires manual labor to know when they’ve been updated. And then, data sets from different sources have different licenses – we use over 100 data sources which means well over 100 different data licenses that we help our clients be compliant with.

How are you working with the specs?

The most popular data format in which we deliver data is the Data Package (see http://frictionlessdata.io/data-packages). Each of our datasets is available, among other formats, as a pair of data.csv and datapackage.json files, complying with the specs at http://specs.frictionlessdata.io. We currently provide over 900 data sets that leverage the Frictionless Data specs.

How did you hear about Frictionless Data?

Two years ago, when we were defining the product requirements and architecture, we researched six different standards for metadata definition over a few months. We found Frictionless Data as part of that research, and after careful consideration have decided to adopt it for all the datasets we curate. The Frictionless Data specifications were the simplest to implement, the simplest to explain to our customers, and enable immediate loading of data into the widest variety of analytical tools.

What else would you like to see developed?

Our data curation guidelines have added more specific requirements, that are underspecified in the standard. For example, there are guidelines for dataset naming, keywords, length of the description, field naming, identifier field naming and types, and some of the properties supported for each field. Adding these to the Frictionless Data standard would make it harder to comply with the standard, but would also raise the quality bar of standard datasets; so it may be best to add them as recommendation. Another area where the standard is worth expanding is more explicit definition of the properties of each data type – in particular geospatial data, timestamp data, identifiers, currencies and units. We have found a need to extend the type system and properties for each field’s type, in order to enable consistent mapping of schemas to different analytics tools that our customers use (Hadoop, Spark, MySQL, ElasticSearch, etc). We recommend adding these to the standard.

What are the next things you are going to be working on yourself?

We are working with Open Knowledge International on open sourcing some of the libraries and tools we’re building. Internally, we are adding more automated validations, additional output file formats, and automated pipelines to load data into ElasticSearch and Kibana, to enable interactive data discovery & visualization.

What do you think are some other potential use cases?

The core use case is making data ready for analytics. There is a lot of Open Data out there, but a lot of effort is still required to make it usable. This single use case expands into as many variations as there are BI & data management tools, so we have many years of work ahead of us to address this one core use case.

CSV,Conf is back in 2017! Submit talk proposals on the art of data collaboration.

- January 12, 2017 in Events, Featured

screen-shot-2017-01-12-at-15-20-12 CSV,Conf,v3 is happening! This time the community-run conference will be in Portland, Oregon, USA on 2nd and 3rd of May 2017. It will feature stories about data sharing and data analysis from science, journalism, government, and open source. We want to bring together data makers/doers/hackers from backgrounds like science, journalism, open government and the wider software industry to share knowledge and stories. csv,conf is a non-profit community conference run by people who love data and sharing knowledge. This isn’t just a conference about spreadsheets. CSV Conference is a conference about data sharing and data tools. We are curating content about advancing the art of data collaboration, from putting your data on GitHub to producing meaningful insight by running large scale distributed processing on a cluster. Talk proposals for CSV,Conf close Feb 15, so don’t delay, submit today! The deadline is fast approaching and we want to hear from a diverse range of voices from the data community. Talks are 20 minutes long and can be about any data-related concept that you think is interesting. There are no rules for our talks, we just want you to propose a topic you are passionate about and think a room full of data nerds will also find interesting. You can check out some of the past talks from csv,conf,v1 and csv,conf,v2 to get an idea of what has been pitched before.

If you are passionate about data and the many applications it has in society, then join us in Portland!

csv-pic Speaker perks:
  • Free pass to the conference
  • Limited number of travel awards available for those unable to pay
  • Did we mention it’s in Portland in the Spring????
Submit a talk proposal today at csvconf.com.  Early bird tickets are now on sale here. If you have colleagues or friends who you think would be a great addition to the conference, please forward this invitation along to them! CSV,Conf,v3 is committed to bringing a diverse group together to discuss data topics. For questions, please email csv-conf-coord@googlegroups.com, DM @csvconference or join the public slack channel. – the csv,conf,v3 team

CSV,Conf is back in 2017! Submit talk proposals on the art of data collaboration.

- January 12, 2017 in Events

screen-shot-2017-01-12-at-15-20-12 CSV,Conf,v3 is happening! This time the community-run conference will be in Portland, Oregon, USA on 2nd and 3rd of May 2017. It will feature stories about data sharing and data analysis from science, journalism, government, and open source. We want to bring together data makers/doers/hackers from backgrounds like science, journalism, open government and the wider software industry to share knowledge and stories. csv,conf is a non-profit community conference run by people who love data and sharing knowledge. This isn’t just a conference about spreadsheets. CSV Conference is a conference about data sharing and data tools. We are curating content about advancing the art of data collaboration, from putting your data on GitHub to producing meaningful insight by running large scale distributed processing on a cluster. Talk proposals for CSV,Conf close Feb 15, so don’t delay, submit today! The deadline is fast approaching and we want to hear from a diverse range of voices from the data community. Talks are 20 minutes long and can be about any data-related concept that you think is interesting. There are no rules for our talks, we just want you to propose a topic you are passionate about and think a room full of data nerds will also find interesting. You can check out some of the past talks from csv,conf,v1 and csv,conf,v2 to get an idea of what has been pitched before.

If you are passionate about data and the many applications it has in society, then join us in Portland!

csv-pic Speaker perks:
  • Free pass to the conference
  • Limited number of travel awards available for those unable to pay
  • Did we mention it’s in Portland in the Spring????
Submit a talk proposal today at csvconf.com.  Early bird tickets are now on sale here. If you have colleagues or friends who you think would be a great addition to the conference, please forward this invitation along to them! CSV,Conf,v3 is committed to bringing a diverse group together to discuss data topics. For questions, please email csv-conf-coord@googlegroups.com, DM @csvconference or join the public slack channel. – the csv,conf,v3 team