You are browsing the archive for Francis Irving.

6 lessons from sharing humanitarian data

- September 30, 2015 in Open Knowledge

Cross-posted from scraperwiki.com This post is a write-up of the talk I gave at Strata London in May 2015 called “Sharing humanitarian data at the United Nations”. You can find the slides on that page. The Humanitarian Data Exchange (HDX) is an unusual data hub. It’s made by the UN, and is successfully used by agencies, NGOs, companies, Governments and academics to share data. They’re doing this during crises such as the Ebola epidemic and the Nepal earthquakes, and every day to build up information in between crises. There are lots of data hubs which are used by one organisation to publish data, far fewer which are used by lots of organisations to share data. The HDX project did a bunch of things right. What were they? Here are six lessons…

1) Do good design

HDX started with user needs research. This was expensive, and was immediately worth it because it stopped a large part of the project which wasn’t needed. The user needs led to design work which has made the website seem simple and beautiful – particularly unusual for something from a large bureaucracy like the UN. [Insert image – Human.Data.Exch.]

2) Build on existing software

When making a hub for sharing data, there’s no need to make something from scratch. Open Knowledge’s CKAN software is open source, this stuff is a commodity. HDX has developers who modify and improve it for the specific needs of humanitarian data. [insert image – ckan]

3) Use experts

HDX is a great international team – the leader is in New York, most of the developers are in Romania, there’s a data lab in Nairobi. Crucially, they bring in specific outside expertise: frog design do the user research and design work; ScraperWiki, experts in data collaboration, provide operational management. [insert image – scraperwiki]

4) Measure the right things

HDX’s metrics are about both sides of its two sided network. Are users who visit the site actually finding and downloading data they want? Are new organisations joining to share data? They’re avoiding “vanity metrics”, taking inspiration from tech startup concepts like “pirate metrics“. [insert image – week 17]

5) Add features specific to your community

There are endless features you can add to data hubs – most add no value, and end up a cost to maintain. HDX add specific things valuable to its community. For example, much humanitarian data is in “shape files”, a standard for geographical information. HDX automatically renders a beautiful map of these – essential for users who don’t have ArcGIS, and a good check for those that do. [image – hdx map]

6) Trust in the data

The early user research showed that trust in the data was vital. For this reason, anyone can’t just come along and add data to it. New organisations have to apply – proving either that they’re known in humanitarian circles, or have quality data to share. Applications are checked by hand. It’s important to get this kind of balance right – being too ideologically open or closed doesn’t work. [image – hdx text]

Conclusion

The detail of how a data sharing project is run really matters. Most data in organisations gets lost, left in spreadsheets on dying file shares. We hope more businesses and Governments will build a good culture of sharing data in their industries, just as HDX is building one for humanitarian data.

The best data opens itself on UK Gov’s Performance Platform

- January 20, 2014 in Open Government Data

This is a guest post by Francis Irving (@frabcus), CEO of ScraperWiki, who has made several of the world’s first civic websites such as TheyWorkForYou and WhatDoTheyKnow. This is the third in a series of posts about the UK Government’s Performance Platform. Part 1 introduced why the platform is exciting, and part 2 described how it worked inside. The best data opens itself. No need to make Freedom of Information requests to pry the information out of the state. No need to build massive directories as checklists for civil servants to track what they’re releasing. Instead, the data is just there. The code just opens it up naturally as part of what it does. One of the unspoken exciting things about the UK Government’s Performance Platform is that it is releasing a whole bunch of open data. Here are two examples. Pet shop licences 1. Licensing performance This is a graph (with data underneath, of course!) of pet shop licenses applied for over time in various counties. It’s part of a larger system which will eventually have all different types of licenses all over the country. You can already find alcohol, food, busking… Lots of topics. As always with open data, there’ll be many unpredictable uses. Most users will do so quietly, you will never know they did. Perhaps a manager at Pets at Home can spot changing pet shop market conditions, or a musician carefully examine the busking license data… 2. Tax disc for vehicles Tax disc applications Basic data about transactional services can potentially tell you a lot about the economy. For example, the graph on the right of vehicle tax disc applications. This could tell an auto dealer – or a hedge fund! – information about car ownership. It is constantly updated, you’re getting much fresher data than any current national statistics. If you need it, the current number of users online is updated in real time. As the performance platform expands, I’d expect it to offer breakdowns by location and type of vehicle. A charity can learn about digital inclusion from this open data. How many people are applying online as opposed to at a post office? The future Already, with the performance platform only in its alpha phase, numerous datasets are being released as a side effect. This will grow for several reasons:
  • GDS aspire to have hundreds of services covered, across the whole range of Government.
  • Service managers in departments can get extra visualisations they need, extending the diversity of data.
  • At some point politicians will start asking for more things to be measured.
  • Maybe in the end activists will make pull requests to improve the data released.
This is great for businesses, charities, citizens, and the Government itself. A fundamentally new kind of open data – that which transactional services can spit out automatically. Making things open makes things better. What data are you looking forward to the performance platform accidentally releasing for you?

9 models to scale open data – past, present and future

- July 18, 2013 in Business, Featured, Ideas and musings, Open Data

Golden spiral, by Kakapo31 CC-BY-NC-SA The possibilities of open data have been enthralling us for 10 years. I came to it through wanting to make Government really usable, to build sites like TheyWorkForYou. But that excitement isn’t what matters in the end. What matters is scale – which organisational structures will make this movement explode? Whether by creating self-growing volunteer communities, or by generating flows of money. This post quickly and provocatively goes through some that haven’t worked (yet!) and some that have.

Ones that are working now

1) Form a community to enter in new data. Open Street Map and MusicBrainz are two big examples. It works as the community is the originator of the data. That said, neither has dominated its industry as much as I thought they would have by now. 2) Sell tools to an upstream generator of open data. This is what CKAN does for central Governments (and the new ScraperWiki CKAN tool helps with). It’s what mySociety does, when selling FixMyStreet installs to local councils, thereby publishing their potholes as RSS feeds. 3) Use open data (quietly). Every organisation does this and never talks about it. It’s key to quite old data resellers like Bloomberg. It is what most of ScraperWiki’s professional services customers ask us to do. The value to society is enormous and invisible. The big flaw is that it doesn’t help scale supply of open data. 4) Sell tools to downstream users. This isn’t necessarily open data specific – existing software like spreadsheets and Business Intelligence can be used with open or closed data. Lots of open data is on the web, so tools like the new ScraperWiki which work well with web data are particularly suited to it.

Ones that haven’t worked

5) Collaborative curation ScraperWiki started as an audacious attempt to create an open data curation community, based on editing scraping code in a wiki. In its original form (now called ScraperWiki Classic) this didn’t scale. Here are some reasons, in terms of open data models, why it didn’t. a. It wasn’t upstream. Whatever provenance you give, people trust data most that they get it straight from its source. This can also be a partial upstream - for example supplementing scraped data with new data manually gathered by telephone. b. It isn’t in private. Although in theory there’s lots to gain by wrangling commodity data together in public, it goes against the instincts of most organisations. c. There’s not enough existing culture. The free software movement built a rich culture of collaboration, ready to be exploited some 15 years in by the open source movement, and 25 years later by tools like Github. With a few exceptions, notably OpenCorporates, there aren’t yet open data curation projects. 6) General purpose data marketplaces, particularly ones that are mainly reusing open data, haven’t taken off. They might do one day, however I think they need well-adopted higher level standards for data formatting and syncing first (perhaps something like dat, perhaps something based on CSV files).

Ones I expect more of in the future

These are quite exciting models which I expect to see a lot more of. 7) Give labour/money to upstream to help them create better data. This is quite new. The only, and most excellent, example of it is the UK’s National Archive curating the Statute Law Database. They do the work with the help of staff seconded from commercial legal publishers and other parts of Government. It’s clever because it generates money for upstream, which people trust the most, and which has the most ability to improve data quality. 8) Viral open data licensing. MySQL made lots of money this way, offering proprietary dual licenses of GPLd software to embedded systems makers. In data this could use OKFN’s Open Database License, and organisations would pay when they wanted to mix the open data with their own closed data. I don’t know anyone actively using it, although Chris Taggart from OpenCorporates mentioned this model to me years ago. 9) Corporations release data for strategic advantage. Companies are starting to release their own data for strategic gain. This is very new. Expect more of it. What have I missed? What models do you see that will scale Open Data, and bring its benefits to billions?

World’s first REAL commercial open data curation project!

- October 3, 2012 in Featured, Legal, Open Data, Policy

The following post is by Francis Irving, CEO of ScraperWiki.

Our laws are still published on calf skin (vellum)

Can you think of an open data curation project where the people who work on it come from multiple commercial companies? In the mid 1990s, as open source code began to boom, the equivalent was commonplace. Geeks working at ISPs would together patch the Apache webserver into shape. Startups like RedHat would pay for staff to work on lots of projects in order to produce a whole operating system. For years I’ve asked, where are the equivalent projects in open data? Nada. Not one. Until today. I finally found one. It’s the UK’s Statute Law database, which is maintained by the National Archives. I explained back in 2006 how it used to be proprietary data, and how it was finally opened up in an incomplete form. Briefly, Parliament doesn’t release a usable set of laws. They release Acts, which are changes to laws (patch files, if you’re a geek). These need to be “consolidated” with existing laws into the actual rules we have to obey. Two commercial companies (LexisNexis and Westlaw, so called after centuries of takeovers) do this consolidation themselves. They charge a handsome price. Nobody can compete with them, as they don’t have the current laws to start from, even if they had the money to keep up with new changes. I spent a chunk of yesterday afternoon talking to John Sheridan (right) from the National Archives. He runs the Government’s Statute Law project. Jeni Tennison (left) is his technical mastermind. Last time I spoke to her a year or two ago she was worried that they would never finish the work. The sheer volume of new laws and difficulty of consolidations seemed insurmountable. Would they ever have a complete image of current law?
Now they’ve cracked it. By forming the world’s first real open data curation project.
I’ll start with a quote from one of the red-in-tooth-and-claw companies who are contributing to this.
I represent the Practical Law Company, one of the private sector organisations involved in the Expert Participation Programme. We’re really excited by these developments and salute John Sheridan and his team for their groundbreaking and elegant work on the API and legislation database. Legislation.gov.uk is the official publishing place for UK legislation and so it is really important work.
The programme is now starting to make a real and visible difference to the status of legislation on the website. By employing people to work with National Archives and as a first step, we’ve been able to ensure that the Companies Act 2006 is now fully consolidated on legislation.gov.uk. This is a particularly important piece of legislation for many of our customers but we intend to carry on the consolidation work on other legislation.
Well done, National Archives. (Source: comment by Elizabeth Woodman)
Truly collaborative The astonishing process goes roughly like this:
  1. John and Jeni and their team build an amazing web admin interface for skilled users to easily piece together the consolidated law jigsaw from the unconsolidated acts and statutory instruments.

  2. Various organisations, such as the Practical Law Company, the Welsh Government (they want to sort out Welsh language law, nobody commercial can be bothered), the Department for Work and Pensions (they make legal guides for tens of thousands of their staff, and so can’t afford the commercial providers) and a couple of other commercial providers (I’ll let John name names, as some that he mentioned to me aren’t fully announced yet) decided they want to contribute.

  3. They pay for some staff to work on it full time. The staff are trained initially by the National Archive, and work for the contributing organisation. There are currently about 30 in total. For example, Practical Law employ 14 people to do this stuff. There’s a queue, they can’t train new ones fast enough to meet demand.

  4. The staff fix up the open data. It appears on legislation.gov.uk, as well as in XML files and as a SPARQL endpoint.

  5. Profit. No really, this is a better business model than stealing underpants. For example, Practical Law release new products based on top of the now lovely clean, free data (such as the Companies Act they mention above).

The National Archive team were marking up 10,000 effects (i.e. patches of one bit of law over another) per year all by themselves. With 15,000 new effects being passed by Parliament each year, they were rapidly getting deeper into debt. Now they’ve improved the process, and have the growning help of industry and other parts of Government, in just one year the basic metadata is done for it all. They aim to have fully caught up by 2015, including secondary legislation. Come the next Parliament, all laws should appear consolidated on the site – and anywhere else that wants it – in real time. Saves money and improves lives It’s win win win win. Well, unless you’re one of the two companies with a proprietary version of the database. Although they don’t seem too unhappy about it – for example, WestLaw has contributed electronic versions of pre-war Statutory Instruments that the Government had lost. In the future there will be even more cost savings. For example, tens of millions are spent each year by the Court Service buying back proprietary copies of the laws they have to enforce. That could end when the open statute law database is fully finished in 2015. However, as ever with public interest activity on the Internet, the real benefit is hidden and subtle. John explained to me that every month about 2 million people land on legislation.gov.uk after searching for things like “allotments act 1950“ in search engines. Most of them are non-lawyer professionals – HR, company secretaries, police officers. Better open legal data will help them do their job more effectively and in less time. The next large user base is concerned citizens, defending their own rights. For example, a mother fighting with her local authority over statementing of her child. Giving them clear access to the law boosts their credibility with the authorities, and helps to make an otherwise messy dispute rules based and easier to resolve. The lesson for open data projects As well as being just brilliant, this story has torn a blindfold off a once baffled me. Why why why are there no collaborative open data curation projects? Zarino Zappia, who works for my company ScraperWiki, did a whole thesis at the Oxford Internet Institute hunting for such projects. He couldn’t find any. I now think the problem with the other nascent projects was that they didn’t include the upstream source (i.e. the National Archive in this case). Upstream help in two ways:

  1. Act as a strong power to set up the project. It was both hard and expensive. In theory the Practical Law Company could have done this, but in practice the economic gain for just them wouldn’t have been enough.
  2. The original source is being fixed. It’s hard to state how much better that is than tidying up a downstream copy (I know, from making things like TheyWorkForYou and ScraperWiki). It’s technically and procedurally much less complicated. It gives a strong provenance and trust that simply cannot be earnt any other way.
Open source projects have different needs to get going. Open data curation is truly unique. You need both the data provider, and commercial contributors, for a sustainable project. What data next? I would like to see the same model applied to other open data sets. How about…
  1. Fine grained inflation data. Apparently somebody external offered to help the ONS improve the way they publish them, but were turned down. Perhaps now, with a successful example elsewhere in Government, this can happen.
  2. Department for Transport data, such as public transport timetables. There’s some collaboration round this already, but would love to see the Government crowd sourcing accurate fixes so that the data becomes perfect (with Google, Apple, and FixMyTransport all contributing!).
  3. Parliamentary debates. I know several organisations (some commercial, some charitable) who curate that data, which is increasingly a commodity. Parliament itself wants to publish it better. A project run between them all would be very powerful.
I’m sure you can think of many more. And here’s the kicker. Jeni has has just been appointed Technical Director of The Open Data Institute. Where she is going to work out how to kickstart a flurry of such successful open data projects. Today our law. Tomorrow the world. You can read more about this project here:    

From CMS to DMS: C is for Content, D is for Data

- March 9, 2012 in Featured, Ideas, Open Standards

This is a joint blog post by Francis Irving, CEO of ScraperWiki, and Rufus Pollock, Founder of the Open Knowledge Foundation. It’s being cross-posted to both blogs.

Content Management Systems, remember those?

Tim Berners-Lee in thought It’s 1994. You haven’t heard of the World Wide Web yet. Your brother goes to a top university. He once overheard some geeks in the computer room making a ‘web site’ consisting of a photo tour of their shared house. He thought it was stupid, Usenet is so much better. The question – in 1994 did you understand what a Content Management System (CMS) was? In the intervening years, CMS’s have gone through ups and downs. Building massive businesses, crashing in the .com collapse. Then a glut, web design agencies all building their own CMS in the early noughties. Ending up with the situation now. A mature market, commoditised by open source WordPress. Anyone can get a page on the web using Facebook. There’s still room for expensive, proprietary players, newspapers custom make their own, and businesses have fancy intranets.

Data Management Systems, time to meet them!

DMSs are also called "data hubs". Hopefully less patented than this wheel!

It’s 2012. You’ve just about heard of Open Data. Your nephew researches the Internet at a top university. He says there’s no future in Open Data, no communities have formed round it. Companies aren’t publishing much data yet, and Governments the wrong data reluctantly. The question – what is a Data Management System (DMS)? There isn’t a very good one yet. We’re at round about where CMS’s were in the mid 1990s. Most people get by fine without them. Just as then we wrote HTML in text files by hand and uploaded it by FTP, now we analyse data on our laptops using Excel, and share it with friends by emailing CSV files. But it reaches the point where using the filesystem and Outlook as your DMS stretches to breaking point. You’ll need a proper one. Nobody really knows what a proper one will look like yet. We’re all working on it. But we do know what it will enable.

What must a DMS do?

All the things people expect a DMS to do!

A mature DMS will let people do all the following things. Whether as a proprietary monolith, or by slick integration across the web:
  • Load and update data from any source (ETL)
  • Store datasets and index them for querying
  • View, analyse and update data in a tabular interface (spreadsheet)
  • Visualise data, for example with charts or maps
  • Analyse data, for example with statistics and machine learning
  • Organise many people to enter or correct data (crowd-sourcing)
  • Measure and ensure the quality of data, and its provenance
  • Permissions; data can be open, private or shared
  • Find datasets, and organise them to help others find them
  • Sell data, sharing processing costs between users
If it sounds like a fat list for a product, that’s because it is. But sometimes the need, the market, pulls you – something simple just won’t do. It has to do or enable, best it can, everything above. (Compare it to the same list for CMSs) In short, it’s what the elite data wrangling teams inside places like Wolfram Alpha and Google’s Metaweb teams do. But made easier and more visible using standardised tools and protocols.

Who’s making a DMS?

More people than I realise. From the largest IT company to the tiniest startup. Here are some I know about, mention more in the comments:
  • Windows / OSX (+ Excel / LibreOffice / …) – the desktop serves as a (good enough so far) DMS
  • CKAN software – started as a data catalog, but has grown into more and powers the DataHub, a community data hub and market. Created by the Open Knowledge Foundation
  • ScraperWiki- coming from the viewpoint of a programmer, good at ETL
  • Infochimps/DataMarket – approaching it as a data marketplace
  • BuzzData – specialising in the social aspects
  • Tableau Public – specialising in visualisation
  • Google Spreadsheets – coming from the web spreadsheet direction
  • Microsoft Data Hub – corporate information management
  • PANDA – making a DMS for newsrooms
They’re all DMS’s because they all naturally grow bad versions of each other’s features. Two examples. ScraperWiki is particularly good at complex ETL (loading data into a system), yet every DMS has to have a data ingestion interface of at least choosing CSV columns. CKAN has particularly good metadata, usage and provenance, yet every DMS has to have a way for people to find the data stored in it.

So will they be giant monolithic bits of software?

We standardised the shipping container, can we standardise data interoperation?

We hope not! That didn’t turn out great for CMSs, although there are some businesses providing that. CMS’s only really came of age when in the mid-noughties everyone realised that WordPress (open source blogging software!) was a better CMS than most CMS’s. It’s in everyone’s interest that users aren’t locked into one DMS. One of them might have a whizzy content analysis tool that somebody who has data in another DMS wants to use. They should be able to, and easily. OKFN is about to launch a standards initiative to bring together such things. It’s called Data Protocols. So far the clearest needs are twofold and mirror each other – pulling and pushing data: a) a data query protocol/format to allow realtime querying, for example for exploring data. Imagine a Google Refine instance live querying a large dataset on OKFN’s the Data Hub. b) a data sync protocol/format that is liken to CouchDB’s protocol. It would let datasets get updated in real time across the web. Imagine a set of scrapers on ScraperWiki automatically updating a visualisation on Many Eyes as the data changed. Later even more imaginative things… I reckon Google’s Web Intents can be used to make the whole experience of the user slick when using multiple DMS’s at once. And hopefully somebody, somewhere is making a simplified version of SPARQL/RDF just as XML simplified SGML and then really took off. Enough of me! What do you think? Join in. Make standards. Write code. Leave a comment below, and join the data protocols list.

“Should Britain flog off the family silver to cut our national debt?”

- March 14, 2011 in External, Open Data

The following post is from Francis Irving, CEO of ScraperWiki.Should Britain flog off the family silver to cut our national debt?‘ — that’s the question the UK current affairs documentary Dispatches tackled last Monday. ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data visualisations, to help viewers understand what assets the UK Government owns. This blog post tells you a bit about the background to them – where the data came from, what it was like, and how and why we made the visualisations. 1. Asset bubbles Inspired by Where Does My Money Go’s bubble chart of public spending, the first is a bubble chart of what central Government owns. We couldn’t find any detailed national asset registry more recent than 2005 (assembled in the National Asset Registry 2007). With a good accounting system, and properly published data all the way through Government, such a thing would constantly update. In some ways there is less of a problem than with Government spending at needing drill down. There isn’t the equivalent problem of wanting to know who the contractor for some spending is, or to see the contract. Instead, you want to know assessments of value, and what investments could do to that value, as well as strategic consequences of losing control of the asset – detailed information that perhaps the authorities themselves often don’t have. The PDFs were mined by hand (by Nicola) to make the visualisation, and if you drill down you will see an image of the PDF with the source of the data highlighted. That’s quite an innovation – one of the goals of the new data industry is transparency of source. Without knowing the source of data, you can’t fully understand the implications of making a decision based on it. Julian used RaphaelJS to code the bubbles (source code here). You can think of it as “JQuery for in browser SVG”. Amazingly, it even works in (most) versions of Internet Explorer (using a compatibility layer via VRML). This has some advantages over Flash – you at least get iPad compatibility. It’s also easier for people with other web skills to maintain than Flex, plus people can “view source” and learn from each other just like in the good old days of the web. That said, on the down side, CSS compatibility with the stylesheets of the site it is embedded in were a pain. We had to override a few higher level styles (e.g. background transparency) to get it to work. Perhaps next time we should use an iframe :) 2. Brownfield sites The second is a map of brownfield landed owned by local councils in England. Or at least, that they owned in 2008. There isn’t a more recent version, yet, of the National Land Use Database. One of the main pieces of feedback we got was people frustrated that we didn’t have up to date, or always complete data. There is definitely an expectation in the public that something as basic as what the Government owns should be available in an online, up to date fashion. The dataset is compiled by the Homes and Communities Agency, who have a goal of improving use of brownfield land to help reduce the housing shortage. This makes it reasonably complete, and cover the whole of England. That’s important, as it gives everyone a good chance that they will find something near them. The data is prepared by local authorities, sent using an Excel or a GIS file (see the guidance notes linked near the bottom of this page) to the agency. Depending where you live, the detail and thoroughness will vary. The same dataset contains lots of information about privately owned land, but we deliberately only show the local authority owned land, as the Dispatches show was about what the state could sell off. It’s quite interesting that a dataset gathered for purposes of developing housing is also useful, as an aside, for measuring what the state owns. It’s that kind of twist of use of data that really requires understanding of the source of the data. The actual application is fairly straightforward Google Maps API and JQuery, although as with the asset bubbles, Zarino made it look and behave fantastic. The main innovative thing is that it tells a story about each site which is constructed from the dataset. For example, what was originally quite a hard to read line in an Excel file comes out as:
JUNCTION OF PARK ROAD, NORTHUMBERLAND STREET Liverpool City Council own this brownfield land. This site was dwellings and is now derelict. It is proposed that it is used for housing. Planning permission is detailed. A developer could build an estimated 14 homes here, selling for £1,820,000 (if they were at £130,000 per home, the median North West price).

Nicola did a lot of testing to make the wording as natural as possible, although we could have done even more. You can see the source code here. We think of these paragraphs as mini constructed stories, local to the viewer, a kind of visualisation as text. Conclusion This kind of visualisation, to help a viewer dig into the details they are most interested in of an overall story or theme, is just the start of how use of (open!) data can help media organisations. I’d like to see more work to integrate the data early on in the development of stories – so it acts as another source, finding leads in an investigation. And I think there are lots of opportunities for news organisations to build ongoing applications, which build audience, revenue and personal stories even when the story isn’t in the 24 hour news cycle. See also Nicola’s post 600 Lines of Code, 748 Revisions = A Load of Bubbles on the ScraperWiki blog. Related posts:
  1. Hacks and Hackers, Birmingham, 23rd July 2010
  2. UK National Statistics: Are They Open or Not?
  3. Opening up linguistic data at the American National Corpus