You are browsing the archive for OpenRefine.

An Introduction to Mapping Company Networks Using Gephi and OpenCorporates, via OpenRefine

- November 15, 2013 in gephi, Infoskills, OpenCorproates, OpenRefine, recipe

As more and more information about beneficial company ownership is made public under open license terms, we are likely to see an increase in the investigative use of this sort of data. But how do we even start to work with such data? One way is to try to start making sense of it by visualising the networks that reveal themselves as we start to learn that company A has subsidiaries B and C, and major shareholdings in companies D, E and F, and that those companies in turn have ownership relationships with other companies or each other. But how can we go about visualising such networks?! This walkthrough shows one way, using company network data downloaded from OpenCorporates using OpenRefine, and then visualised using Gephi, a cross-platform desktop application for visualising large network data sets: Mapping Corporate Networks – Intro (slide deck version). The walkthrough also serves as a quick intro to the following data wrangling activities, and can be used as a quick tutorial to cover each of them.
  • how to hack a web address/URL to get data-as-data from a web page (doesn’t work everywhere, unfortunately;
  • how to get company ownerships network data out of OpenCorporates;
  • how to download JSON data and get it into a nice spreadsheet/tabular data format using OpenRefine;
  • how to filter a tabular data file to save just the columns you want;
  • a quick intro to using the Gephi netwrok visualisation tool;
  • how to visualise a simple date file containing a list of how companies connect using Gephi;
Download it here: Mapping Corporate Networks – Intro. So if you’ve ever wondered how to download JSON data so you can load it into a spreadsheet, or how to visualise how two lists of things relate to each other using Gephi, give it a go… We’d love to hear any comments you have on the walkthrough too, (what you liked, what you didn’t, what’s missing, what’s superfluous, what worked well for you, what didn’t and most of all – what use you put to anything you learned from the tutorial!:-) If you would like to learn more about working with company network data, see the School of Data blogpost Working With Company Data which links to additional resources. flattr this!

Working With Company Data

- October 31, 2013 in #OGP13, DBpedia, event, Events, HowTo, opencorporates, OpenRefine

We all think we know what we mean by “a company”, such as the energy giants Shell or BP, but what is a company exactly? As OpenOil’s Amit Naresh explained in our OGP Workshop on “Working With Company Data” last week, the corporate structure of many multinational companies is a complex network of interconnected countries domiciled or registered in a wide variety of countries across the world in order to benefit from tax breaks and intricate financial dealings. Given the structure of corporate networks can be so complex, how can we start to unpick and explore the data associated with company networks? The following presentation – available here: School of Data: Company Networks – describes some of the ways in which we can start to map corporate networks using open company data published by OpenCorporates using OpenRefine. Placeholder We can also use OpenRefine to harvest data from OpenCorporates relating to the directors associated with a particular company or list of companies: School of Data: Grabbing Director Dara A possible untapped route to harvesting company data is Wikipedia. The DBpedia project harvests structured data from Wikipedia and makes it available as a single, queryable Linked Data datasource. An example of the sorts of network that can be uncovered from independently maintained maintained Wikipedia pages is shown by this network that uncovers “influenced by” relationships between philosophers, as described on Wikipedia: WIkipedia philosophers influence map See Visualising Related Entries in Wikipedia Using Gephi and Mapping Related Musical Genres on Wikipedia/DBPedia With Gephi for examples of how to generate such maps directly from Wikipedia using the cross-platform Gephi application. For examples of the sorts of data available from DBpedia around companies, see: Using Wikipedia – or otherwise hosted versions of the MediWiki application that Wikipedia sits on top of – there is great potential for using the power of the crowd to uncover the rich network of connections that exist between companies, if we can identify and agree on a set of descriptive relations that we can use consistently to structure data published via wiki pages… flattr this!

Using OpenRefine to Clean Multiple Documents in the Same Way

- July 26, 2013 in OpenRefine, OpenSpending, School_Of_Data

When working with data that is published on a monthly basis according to the same template, it is often the case that we need to apply the same data cleaning rules to the data each time a new file is released. This recipe shows to use OpenRefine to create a reusable script for cleaning data files that get published month on month according to the same schedule. To make things a little more concrete, consider this example. Under UK transparency regulations, local councils publish spending data for amounts over £500 on a monthly basis: IW council transparency data To get to the actual download link for the CSV data files requires another click… (example CSV file, if you want to play along…;-) IW council transparency data download link If we want to get this data into a site such as, we need to do a little bit of tidying of the data to get it into the format that OpenSpending expects (see for example The OpenSpending data format). As each new file is released, we need to to clean it as we have cleaned the files before. If you are a confident programmer, you could write a script to handle this process. But what can the rest of us to to try to automate this process and make life a little easier. One way is to use OpenRefine’s ability to “replay” cleaning scripts that you have generated before. Here’s an example of how to do just that… Let’s start by loading the data in from a CSV file on the Isle of Wight local council website – we need to copy the download link URL ourselves: OPenRefine - import from CSV The recent spending files are in a common format (that is, they have been published according to the same template), which is something we’ll be relying on, so we could load multiple files in at once into one big data file, but in this case I’m going to take each file separately. OpenRefine makes a good guess at the file format. One minor tweak we might make is to ignore any blank lines (OpenSpending doesn’t like blank lines!). openrefien import csv settings Here’s what the data looks like once we import it: preview the datai in openrefine OpenSpending expects the data to be presented in a particular way, which is why we need to clean the data a little before we can upload it. For example, OpenSpending likes column names that are free of spaces and punctuation; it requires an amount column that just contains numbers (so no commas in the number to make it more readable!); it requires dates in the format 2013-07-15 (that is, the yyyy-mm-dd format) (and I think it needs this column to be called time?). Here’s how we can rename the columns: openrefien - rename column Rename each column in turn, as required – for example, remove any punctuation, use camelCase (removeing spaces and using capital letters to make work boundaries), or replace spaces with underscores (_). Let’s look at the amount column – if we select the numeric facet we can see there are lots of things not identified as numbers: open refine - check amount as numeric We can preview what the non-numeric values are so we can set about tidying them up… Openrefine look for non-numerics So commas appear to be the major issue – let’s remove them by transforming cells in that column that contain a comma by removing the comma. openrefine transform We can do this by replacing a comma whenever we see one with nothing (that is, an empty character string) – value.replace(',','') OpenRefine  - remove comma Let’s see what effect that has on the numeric facet view: Open Refine amount after cleaning Well that seems to have worked – no non-numerics any more… (We might also use the sliders in the numeric facet to check the outlying values to see if they are plausible, or look like they may be errors.) As far as a the dates go, we have dates in the form 17.04.2013 rather than 2013-04-17 so let’s transform them into the required format. However, because there is more scope for things going wrong with this transformation, let’s put the transformed data into a new column: open refine  - add column Here’s how we define the contents of that column: oen refine date transfrom That is: value.toDate('').toString('yyyy-mm-dd') The first part – value.toDate('') interprets the string as a date presented in a given format, and then transforms that data into the rewquired date format: .toString('yyyy-mm-dd') We can quickly preview that this step has worked by transforming the column to a date type: open refine trasnform to date and then preview it using a timeline facet to check that all seems in order. OPenrefien timeline facet (If there were non-date or error elements, we could select them and look at the original date and transformed date to see where the problem may lie.) We don’t want the date formatting for out OpenSpending data file, so we can undo the step that transformed the data into the timeline viewable date format: openrefine history So now we have our cleaned data file. How can we apply the same steps to another month? If you look at the history tab, you will see it offers an “Extract” option – this provides a history of the change operations we applied to the dataset. If we copy this history to a text file somewhere, we can then make use of it again and again. open refine extract To see how, open another OpenRefine project and import the data for another month (such as this one). When you have created the project, click on the Undo/Redo history tab and select Apply: open refine  - appply previous transformations Paste in the transformation script we grabbed from the previous project: open refne - extract reuse Here’s the script I used – When we apply the script, the data is cleaned using the same operations as previously: open refine cleanded as if by magic That is, as long as the new data file is in the same format as the previous one, and only requires the same cleaning operations, we don’t really have much to do – we can just reuse the script we generated the first time we cleaned a file of this type. And each time a new file is published, we have a quick recipe for cleaning it up so we can get it into OpenSpending:-) Finally, we can export the data for use as required… open refine export data Magic:-)
flattr this!

OpenRefine/LODRefine – A Power Tool for Cleaning Data

- July 15, 2013 in Data Cleaning, OpenRefine

In Even the Simplest Data Needs Cleaning Sometimes… A (Slightly Broken?!) Example Using Google Spreadsheets, we saw how Google Spreadsheets could be used as a tool for cleaning a simple dataset, a table of literacy rates around the world grabbed from a Wikipedia page. WHilst spreadsheets can be used to tidy a dataset, they’re not necessarily the best tool for it. In this post, we’ll see how a more powerful tool – OpenRefine – can help us clean a data set in a more controlled way. OpenRefine – formerly known as Google Refine – is a power tool for cleaning data. As an opensource project, the code can be reused in other projects. In this example, I’ll be using LODRefine, a version of OpenRefine that has several optional OpenRefine extensions for working with linked data built in to the application (though we won’t be using any of those features in this example; OpenRefine will work just as well). The data set we’ll be using is the same as the dataset used in the aforementioned post, a data table copied from a Wikipedia page on literacy rates. WIkipedia table grab copy If we launch OpenRefine/LODRefine (which I’ll refer to as just “Refine” from now on), we can paste the data copied from the Wikipedia page in to an import form directly: LOD refine paste from clipboard If we now import the data, a wealth of configuration options are presented to us. In this case, by inspection of the data, we need to import the data as tab separated data, and also ignore the first line of data so that the correct column headings are used: LODrefine - create project config If we now create the project, the data is imported as a table: LOD Refine data loaded As we did in the earlier post, lets tidy the data by removing percentage signs from the numerical columns. To do this, we select the Transform tool from the drop down menu in the column we want to tidy: LOD Refine transform The operation we’re going to do is to replace the percentage sign with nothing (that is, and empty string). We use a simple text command to achieve this (value refers to the value of a cell in the selected column; the command is applied to each row in turn). LODRefine value replace Here’s the result when we apply the transformation: Refine aftere the replace We can improve the quality of this data a little more by telling Refine that the entries in the column are numbers, rather than strings of alphanumeric characters: Refine convert to number Colour coding shows us that Refine now recognises the numbers as numbers: Refine as number We can now start to apply a similar cleaning step to the other columns, reusing the transform we have just created: Refine reuse trasnform We can also build in the transformation to a number as an additional command, rather than having to do this step via the menu: Refine replace and cast to number You will notice that there appears to be at least one cell that has additional mess in the cell. Go with it anyway. Where there is an error in a step, the original cell value (the one with the % sign) will be returned. We don’t know if other cells might also have returned an error, so after applying the transform we can filter the column to show only those rows where the column value contains a % sign: Filter on % It seems as if there was just one rogue cell. Let’s fix it directly – edit the cell: refine edit cell Delete the rogue characters: Refine cell edit detail And set as a number type: refine cell edit type cast Let’s now do a bit of tidying around the Criteria column. To do this, we will keep the original column and generate a new column based on it: LOD refine add column In this case, I am going to use a regular expression (as we encountered in the earlier post), to edit a cell’s contents by deleting a particular pattern set of characters. Note that in the replace command we use slashes ( / .. / ), rather than quotes, to identify the pattern we want to replace: LODRefine regex What this expression does is look for the pattern can read and write followed by any number of characters, and if it finds that pattern it replaces it with an empty string (i.e. with nothing; it deletes the matched pattern). LOD Refine not quite right.. Okay – so this has mostly worked… But there are some rows where there was no pattern match and the original value has been copied over. One way round this would be to filter the original Criteria column to show only the rows that do match the pattern, and then apply the transformation: If we select the Filter option from the menu at the top of the Criteria column, we can then create a filter to just pull out the pattern matching rows: LOD Refine filter before create column If we now generate the new column on this filtered subset, then remove the filter, we see we have only processed the matched rows: LOD REfine filtered column replace We can process the Criteria column in other ways too. For example, we might want to create a new column that contains the date around which the data was collected: LOD refine start to look at extracting date Or we might want to create a column that contains just the age, as a number, derived from our newly created Age2 column. (Can you see how to do that? You need to clean the column to just leave the number, then cast it to a number type.) As well as providing these powerful data cleaning options (as well as many more!), Refine also keeps tracks of the steps you have followed: Refine history If you make a mistake as you work through the dataset, you can always step back. When you save a Refine project, the original data and the history of changes will be available to you. Of course, you also want to be able to get you cleaned data set out – the Export menu is the place to do that. Refine export It gives you a wealth of options for exporting the data, including save to file (in a variety of formats) and upload to Google Spreadsheets. OpenRefine (and extended clones such as LODRefine) offers a powerful tool for cleaning messy datasets. Whilst it may appear complex to use at first, it offers far more control than more general purpose tools, such as spreadsheets or text editors. flattr this!

Analysing UK Lobbying Data Using OpenRefine

- June 4, 2013 in Data Cleaning, OpenRefine

Being able to spot when we might might be able to turn documents into datasets is a useful skill for any data journalist or watchdog to develop, In this (rather long!) practical walkthrough post, we’ll see how we can start to use OpenRefine to turn a set text based forms into data. Specifically, given a piece of text that details the benefits awarded to a political group from a set of possible lobbying interests, how can we pull out the names of some of the lobbiests involved, along with some of the financial amounts they have donated, and turn it into data we can start to work with? For example, in a piece of text that has the form:
Lobby Group X (a not-for-profit organisation) acts as the group’s secretariat. Company A paid £3200 towards the cost of a seminar held in May 2012 and Another Company paid £1200 towards the cost of a reception held in January 2013 (registered April 2013).

how could we pull out the name of the organisation providing secretariat support, or the names of the financial benefactors, the amounts they provided and the reason for the donation? In this post, we’ll see how we can use OpenRefine to start pulling out some of this information and put it into a form we can use to start looking for connected or significant interests. The screenshots used to illustrate this post can be found in this photo set: School of Data – Open Refine – All Party Groups The context for this investigation is a minor political scandal that broke in the UK over the last few days (Patrick Mercer resignation puts spotlight on lobbying). In part, the event put the spotlight onto a set of political groupings known as All-Party Groups, informal cross-party or subject or topic specific interest groups made up from members of both houses of the UK parliament (see for example the list of All Party Groups). Many All Party Groups are supported by lobby groups, or other organisations with a shared interest. Support may take the form of providing secretariat services, making financial donations to support the group’s activities, or covering the costs of associated travel and accommodation expenses. As such, there has always been the risk that the groups might play a role in lobbying scandal (for example, APPGs – the next Westminster scandal?). The UK Parliament website publishes transparency information that details the officers of the group and the names of twenty founding members (though the group may have many more members), along with disclosures about benefits received by the group. The information is published as a set of web pages (one per group) as well as via a single PDF document. example EPG page A newly formed website, Allparty, has started teasing out some of this data, including trying to break out some of the benefits information and organising groups according to meaningful subject tags, although as yet there is no API (that no, no programmable way) of accessing or querying their datasets. A scraper on scraperwiki – David Jones’ All Party Groups – makes the raw data for each group available, as well as detailing group membership for each MP or Lord. APG scraper on scraperwiki The Scraperwiki API allows us to interrogate this data using structured queries of the described in the School of Data post Asking Questions of Data – Some Simple One-Liners. However, it does not break down the benefits information into a more structured form. So how might we start to pull this data out? From the download link on the Scraperwiki scraper, we can get a link to a CSV file containing the data about the All Party Groups. We can use this link as a data source for a new OpenRefine project: openrefine import data csv If the CSV format isn’t detected automatically, we can configure the import directly before we create the project. openrefine csv import config Having got the data in, let’s start by trying to identify the groups that provide secretariat support. If we don’t get them all, it doesn’t matter so much for now. The aim is to get enough data to let us start looking for patterns. We can see which rows have a benefit relating to secretariat support by filtering the Benefits column. open refine text filter If we look at some of the descriptions, we see there is a whole host of variations on theme… how many ways of providing secretariat So how might we go about creating a column that contains the name of the organisation providing secretariat support? open refine add column based on column “Parsing” Data using GREL Replace Expressions When it comes to defining the contents of the new column, we might notice that the Benefits descriptions often start with the name of the organisation providing secretariat services. We could use the GREL expression language to simplify the text by spotting certain key phrases and then deleting content from there on. For example, in the sentence: Confederation of Forest Industries (a not-for-profit organisation) acts as the groups secretariat. we could just delete ” acts as the groups secretariat.”. The GREL expression value.replace(/ acts as the groups secretariat./,'') replaces the specified phrase with nothing (that is, an empty string). (Note that the “.” represents any single character, not just a full stop.) By recognising patterns in the way Benefits paragraphs are structured, we can start to come up with a crude way of parsing out who provides the secretariat. openrefine secretariat This leaves us with a bit of a mess if there is text following the deleted phrase, so we can instead delete any number of characters (.) following the phrase using value.replace(/ acts as the groups secretariat./,''). openrefine secretariat 2 We also notice that there are other constructions that we need to account for… We can start to knock these out in a similar way by adding additional replace elements, as this construction shows: openrefine secretariat 3 That is, value.replace(/ act as the groups secretariat./,'').replace(/ provides secretariat./,'') If we look through some of the paragraphs that aren’t tidied up, we notice in some cases there are constructions that are almost, but don’t quite, match constructions we have already handled. For example, compare these two: acts as the groups secretariat.
to act as the groups secretariat.
We can tweak the replace expression to at least identify either “act” or “acts” by telling it to look for the word “act” optionally followed by the character s (s?), that is, value.replace(/ acts? as the groups secretariat./,'').replace(/ provides secretariat./,'') openrefine secretariat 4 Let’s stick with this for now, and create the column, Secreatariat, looking at the unique values using a text facet. If we sort by count we see that we have already started to identify some groups that support several of the groups. starting to explore the secretariat If we look through the other values, we see there is still quite a bit of work to be done. tidying work to be done If we go back to the Benefits column and create another new column based on it, Secretariat v2, we can reuse the replace expression we started to build up previously and work on a new improved secretariat column. openrefine grel reuse Alternatively, we can rethink our replace strategy by looking for key elements of the construction. There are various constructions based on “act” or “acts” or “provide” for example, so we can try to knock those out in a rather more aggressive way, also getting rid of any ” to” statement at the end of the phrase: value.replace(/ act.secretariat./,'').replace(/ provide.secretariat./,'').replace(/ to/,'') simplified replace Looking through some of the phrases we are left with, there is a noticeable number of the form Company A (a consultancy) is paid by its client, Company B, to which are derived from phrases such as Company A (a consultancy) is paid by its client, Company B, to act as the groups secretariat. We could create a second column from the Secretariat v2 column that contains this information. The first thing we’d need to do is identify who’s paying: value.replace(/.* is paid by its clients?,? /,'') We can tighten this up a little by splitting it into two parts to cope with the optional “its clients?” statement: value.replace(/.* is paid by /,'').replace(/its clients?,? /,'') The first part of this command gets rid of everything up to (.*) and including is paid by; the second part deletes its client with an optional s (s?) followed by an optional comma and then a space. To only copy third party funders names across to the new column, we test to make sure that the replace was triggered, by testing to see if the proposed new column entry is different to the contents of the original cell, adding a further replace to tidy up any trailing commas if(value.replace(/.* is paid by /,'').replace(/its clients?,? /,'')!=value,value.replace(/.* is paid by /,'').replace(/its clients?,? /,''),'').replace(/,$/,'') secretariat funders (I wonder if there is a tidier way of doing this?) Let’s now create another column, Secretariat v3, based on Secretariat v2 that gets rid of the funders: value.replace(/,? is paid by.*/,'') delete funders If we generate a text facet on this new column, we can display all the rows that have something set in this column: open refine non-blank cells We also notice from the text facet counts that some funders appear to support multiple groups – we can inspect these directly: problem - dupe data Hmm… some of the groups have similar looking names – are these the results of a name change, perhaps, leaving stale data in the original dataset – or are they really different groups? Such is the way of working with data that has now been obtained directly from a database! There’s often always more tidying to do! If we look to see if there are any groups that appear to offer a lot of secretariat support, we notice one in particular – Policy Connect: popular secretariat We can also see that many of the so-supported groups have Barry Sheerman as a contact, and a couple declare Nik Dakin; at least two of them employ the same person (John Arnold). The point to make here is not so much that there may be any undue influence, just that this sort of commonality may go unnoticed when scattered across multiple documents in an effectively unstructured way. Whilst there is still work that could be done to further tidy the data set (for example, pulling out “on behalf of” relationships as well as “paid by” relationships), we have got enough data to start asking some “opening” structured questions, such as: are there companies that provide secretariat services for more than one group (answer: Yes, though we need to check fo duplicate group names…); what companies or groups fund other parties to provide secretariat services? (Note we may only have partial data on this so far, but at least we have some to start working with, and by inspecting the data we can see how we may need to clean it further). For example, here is a copy of a CSV file containing the data as tidied using the above recipe, and here is the OpenRefine project file. We can load the CSV file into a Google spreadsheet and start to interrogate it using the approach described in the School of Data post Asking Questions of Data – Garment Factories Data Expedition. “Parsing” Data using Jython (Python) Regular Expressions
In the previous section we saw how we could use GREL expressions to start parsing out data from a text paragraph. In this section, we’ll see how we can use Jython (that is, Python) for a similar purpose. To start with, here’s a quick example of how to pull out the names of groups providing secretariat services, as we did using the GREL script: import re
tmp=re.sub(r'(.) ( to )?(provide|act)+[s]?.secretariat.',r'\1',tmp)
) (is paid|on behalf of).*',r'\1',tmp)
if value==tmp:tmp=''
return tmp
Jython secretariat The first line (import re) loads in the required regular expression library. For convenience, we assign the contents of the cell to a tmp variable, then we look for a string that has the following structure:
  • .* – any number of characters
  • ( to )? – followed by a space character and optionally the word to
  • (provide|act)+[s]? – followed by either the word provide or act with an optional s
  • .*secretariat.* – followed by any number of characters, then the word secretariat then any number of characters.
The re.sub() function has the form re.sub( r'pattern to match',r'result of this expression',STRING_INPUT). In the script above, if the pattern is matched, tmp is set to the value of whatever is contained in the first pair of brackets of the expression being matched (\1). We can also use regular expressions to pull out the funding groups (Python regular expressions – documentation). So for example, do the initial configuration: import re
In the following expression, we take whatever matches in the third set of brackets. This broadly matches patterns of the form “X is paid by Y to provide the secretariat” and allows us to extract Y. tmp = re.sub(r'(.) (is paid by|on behalf of)(.) to (provide|act).secretariat.',r'\3', tmp) This results in some strings of the form: “its client, Transplant 2013,” which we can tidy as follows:
tmp = re.sub(r'its client[s,]+ (.*)[,]+$',r'\1', tmp) There are some constructions that are not captured, eg that take the form “X provides secretariat support on behalf of Y.” The following structure grabs Y out for these cases. tmp = re.sub(r'.secretariat.(on behalf of )([^.]).',r'\2', tmp) If we haven’t matched any payers (so the tmp string is unchanged), return a blank if value==tmp: tmp=''
return tmp
Using Jython to grab funder names Here’s how the different techniques compare: Comparing performance Remember that what we’re trying to do is structure the data so that we can start to run queries on it. Looking at the benefits, we notice that some groups have received financial support from different groups. For example, the All-Party Parliamentary Gas Safety Group has declared the following benefits:
Policy Connect (a not-for-profit organisation) provides secretariat services to the group. 3000 from the Council of Gas Detection and Environmental Monitoring; 7500 from the Energy Network Association; 6000 from Energy UK; 9000 from the Gas Industry Safety Group; 10,000 from the Gas Industry Safety Group (registered October 2011).
One of the things we might want to do in this case is pull out the amounts awarded as well as the organisations making the donation. Then we can start to count up the minimum benefits received (if we miss some of the data for some reason, we won’t have the total record!) as well as the organisations financially supporting each APG. If we look at the benefits, we see they tend to be described in a common way – N,NNN from the Company X; MMMM from the Charity Y. Here’s one way we might start pulling out the data: import re
m=re.findall(r'\d[,0-9]+ from the [^;.]*',tmp)
return m
*Note: it may be better to omit the the from that pattern… For example, our expression will miss “5000 from Bill & Melinda Gates Foundation (registered July 2012)”. Group payments A good example to test would be the one from the All Parliamentary Group on Global Health, which also includes constructions such as 5000 from the University of Manchester and 3000 from the Lancet (registered August 2012), which the expression above will identify as a single item, and 5000 from each of the following: Imperial College, Kings Health Partners, London School of Hygiene and Tropical Medicine, Cambridge University Health Partners (registered May 2012). which (if we fix the the problem would confuse it thoroughly! If we return the items from the list as a single string, we can then split this list across multiple rows – change the last line to return '::'.join(m): Joined items With the data in this form, we can generate multiple rows for each group, with one record per payment. Let’s join the separate items from the list using an easy to identify separator (::): Split multi cells Using a similar approach to one we used before, use a text facet to only show rows with a value in the Group Payments column: filter on non-blank again We can then fill values down on required columns to regenerate complete rows: fill down a column Note that if filled down the columns on the whole dataset, we would incorrectly start filling values in on rows that were not generated from splitting the payments column across multiple rows. We can also create two new columns from the group payments column, one identifying the amount the other the name of the organisation that made the payment. Let’s go back to GREL to do this – first let’s pull out the amounts, and then remove any commas from the amount so all we’re left with is a number – value.replace(/ from the.*/,'').replace(',','') payment extract Let’s naively try to pull out the organisations responsible – value.replace(/.* from the /,'') payer extract NOTE: this actually causes an error… the failure to pull out atomic statements relating to payments means that if we assume sentences are of the simple form 5000 from the Essar Group (registered April 2013) rather than the more complex 7625 from the Royal Academy of Engineering, 2825 from the University of Warwick, 5000 from the Essar Group (registered April 2013) , we end up parsing out the wrong thing. error in the parsing This is one of the reasons why it’s much easier of people publish data as such! However, let’s proceed, remembering that we have broken the data, but pressing on regardless to see what we might be able to do with it if we go back and correct our errors! For example, let’s tidy up the data a bit more, casting the Payment Amount column to be a numeric type: convert to number We can then use a numeric filter to identify payments within a particular range, for example (remembering that our data may be meaningless!): numeric facet filter It’s also worth remembering the other problems and omissions we have introduced into this data – the the inclusion in the pattern recogniser for the amounts was over-zealous and ruled out a match on several organisations, for example; and as well as causing numerical errors, we have missed out information about separate payments that were anded together when they were declared in the same month. The moral of this story is – look for what the exceptions might be and try to trap them out rather than letting them contaminate your data! Like I didn’t… doh! That said, at least we’ve made a start, and can work iteratively to improve the quality of the data we have extracted – if we think it is worth spending the time doing so; we also have some data to hand that let’s us start exploring possible structured queries over the data, such as how much funding particular secretariats appear to have managed to bring in across several groups, for example, or how many benefits have been received by groups with a particular MP specified as a contact. (From the original data on Scraperwiki, we can also see which MPs or Lords are on the list of the 20 declared names for each group: do some only appear to join groups that declare benefits?!) In other words, even though we should not at this point trust any of the results of queries, because of the errors that were introduced, we can still start to explore the sorts of queries we might be able to write, which in turn can help us decide whether or not it is worth the effort cleaning the data any further… …or maybe we should have considered what sort of data model we would be able to get out and interrogate in the first place?! flattr this!