You are browsing the archive for tech.

Structuring a Global Online Survey – A Question Engine for Open Data Surveys!

- January 17, 2017 in code, Global Open Data Index, godi, local index, Open Data Index, open data survey, tech

The Global Open Data Index (GODI) is one of our core projects at Open Knowledge International. The index measures and benchmarks the openness of government data around the world. Brook Elgie shares a behind-the-scenes look at the technical design of how we gather the data for the Index through our extensive Open Data Survey and how other organisations can use this survey codebase for their own purposes. The Global Open Data Index Survey is an annual survey of the state of government open data around the world. The survey asks a series of questions about the availability and quality of a set of key datasets. As well as providing a valuable snapshot of the state of open data around the world, it also promotes discussion and engagement between government and civil society organisations. This year Open Knowledge International made changes to the methodology and structure of the survey, and it was an ideal opportunity to revisit the way questions are handled technically within the survey codebase. As well as the survey for the Global Open Data Index, the same codebase hosts surveys for ‘local’ sites, for example, an individual country, or city administration. screen-shot-2017-01-14-at-01-25-05 Previously, the questions presented for each dataset were a hard-coded feature of the survey codebase. These questions were inflexible and couldn’t be tailored to the specific needs of an individual site. So, while each local site could customise the datasets they were interested in surveying, they had to use our pre-defined question set and scoring mechanisms. We also wanted to go beyond simple ‘yes/no’ question types. Our new methodology required a more nuanced approach and a greater variety of question types: multiple-choice, free text entry, Likert scales, etc. Also important is the entry form itself. The survey can be complex but we wanted the process of completing it to be clear and as simple as possible. We wanted to improve the design and experience to guide people through the form and provide in-context help for each question.

Question Sets

The previous survey hard-coded the layout order of questions and their behaviour as part of the entry form. We wanted to abstract out these details from the codebase into the CMS, to make the entry form more flexible. So we needed a data structure to describe not just the questions, but their order within the entry form and their relationships with other questions, such as dependencies. So we came up with a schema, written in JSON. Take this simple set of yes/no questions:
  1. Do you like apples?
  2. Do you like RED apples? (initially disabled, enable if 1 is ‘Yes’)
  3. Have you eaten a red apple today? (initially disabled, enable if 2 is ‘Yes’)
We want to initially display questions 1, 2, and 3, but questions 2 and 3 should be disabled by default. They are enabled once certain conditions are met. Here is what the form looks like: animated_apples And this is the Question Set Schema that describes the relationships between the questions, and their position in the form: Each question has a set of default properties, and optionally an ifProvider structure that defines conditional dependent features. Each time a change is made in the form, each question’s ifProvider should be checked to see if its properties need to be updated. For example, question 2, apple_colour, is initially visible, but disabled. It has a dependency on the like_apples question (the ‘provider’). If the value of like_apples is Yes, apple_colour‘s properties will be updated to make it enabled.

React to the rescue

The form is becoming a fairly complex little web application, and we needed a front-end framework to help manage the interactions on the page. Quite early on we decided to use React, a ‘Javascript library for building user interfaces’ from Facebook. React allows us to design simple components and compose them into a more complex UI. React encourages a one-way data flow; from a single source of truth, passed down into child components via properties. Following this principle helped identify the appropriate location in the component hierarchy for maintaining state; in the top level QuestionForm component. apples_composed Component’s hierarchy for the entry form:
  1. QuestionForm (red)
  2. QuestionField (orange)
  3. Sub-components: QuestionInstructions, QuestionHeader, and QuestionComments (green)
Changing values in the QuestionFields will update the state maintained in the QuestionForm, triggering a re-render of child components where necessary (all managed by React). This made it easy for one QuestionField to change its visible properties (visibility, enabled, etc) when the user changes the value of another field (as determined by our Question Set Schema). You can see the code for the entry form React UI on Github. Some other benefits of using React:
  • it was fairly easy to write automated tests for the entry form, using Enzyme
  • we can render the initial state of the form on the server and send it to the page template using our web application framework (Express)

Developing in the Open

As with all of Open Knowledge International’s projects, the Open Data Survey is developed in the Open and available as Open Source software: Open Data Survey on Github.

Structuring a Global Online Survey – A Question Engine for Open Data Surveys!

- January 17, 2017 in code, Global Open Data Index, godi, local index, Open Data Index, open data survey, tech

The Global Open Data Index (GODI) is one of our core projects at Open Knowledge International. The index measures and benchmarks the openness of government data around the world. Brook Elgie shares a behind-the-scenes look at the technical design of how we gather the data for the Index through our extensive Open Data Survey and how other organisations can use this survey codebase for their own purposes. The Global Open Data Index Survey is an annual survey of the state of government open data around the world. The survey asks a series of questions about the availability and quality of a set of key datasets. As well as providing a valuable snapshot of the state of open data around the world, it also promotes discussion and engagement between government and civil society organisations. This year Open Knowledge International made changes to the methodology and structure of the survey, and it was an ideal opportunity to revisit the way questions are handled technically within the survey codebase. As well as the survey for the Global Open Data Index, the same codebase hosts surveys for ‘local’ sites, for example, an individual country, or city administration. screen-shot-2017-01-14-at-01-25-05 Previously, the questions presented for each dataset were a hard-coded feature of the survey codebase. These questions were inflexible and couldn’t be tailored to the specific needs of an individual site. So, while each local site could customise the datasets they were interested in surveying, they had to use our pre-defined question set and scoring mechanisms. We also wanted to go beyond simple ‘yes/no’ question types. Our new methodology required a more nuanced approach and a greater variety of question types: multiple-choice, free text entry, Likert scales, etc. Also important is the entry form itself. The survey can be complex but we wanted the process of completing it to be clear and as simple as possible. We wanted to improve the design and experience to guide people through the form and provide in-context help for each question.

Question Sets

The previous survey hard-coded the layout order of questions and their behaviour as part of the entry form. We wanted to abstract out these details from the codebase into the CMS, to make the entry form more flexible. So we needed a data structure to describe not just the questions, but their order within the entry form and their relationships with other questions, such as dependencies. So we came up with a schema, written in JSON. Take this simple set of yes/no questions:
  1. Do you like apples?
  2. Do you like RED apples? (initially disabled, enable if 1 is ‘Yes’)
  3. Have you eaten a red apple today? (initially disabled, enable if 2 is ‘Yes’)
We want to initially display questions 1, 2, and 3, but questions 2 and 3 should be disabled by default. They are enabled once certain conditions are met. Here is what the form looks like: animated_apples And this is the Question Set Schema that describes the relationships between the questions, and their position in the form: Each question has a set of default properties, and optionally an ifProvider structure that defines conditional dependent features. Each time a change is made in the form, each question’s ifProvider should be checked to see if its properties need to be updated. For example, question 2, apple_colour, is initially visible, but disabled. It has a dependency on the like_apples question (the ‘provider’). If the value of like_apples is Yes, apple_colour‘s properties will be updated to make it enabled.

React to the rescue

The form is becoming a fairly complex little web application, and we needed a front-end framework to help manage the interactions on the page. Quite early on we decided to use React, a ‘Javascript library for building user interfaces’ from Facebook. React allows us to design simple components and compose them into a more complex UI. React encourages a one-way data flow; from a single source of truth, passed down into child components via properties. Following this principle helped identify the appropriate location in the component hierarchy for maintaining state; in the top level QuestionForm component. apples_composed Component’s hierarchy for the entry form:
  1. QuestionForm (red)
  2. QuestionField (orange)
  3. Sub-components: QuestionInstructions, QuestionHeader, and QuestionComments (green)
Changing values in the QuestionFields will update the state maintained in the QuestionForm, triggering a re-render of child components where necessary (all managed by React). This made it easy for one QuestionField to change its visible properties (visibility, enabled, etc) when the user changes the value of another field (as determined by our Question Set Schema). You can see the code for the entry form React UI on Github. Some other benefits of using React:
  • it was fairly easy to write automated tests for the entry form, using Enzyme
  • we can render the initial state of the form on the server and send it to the page template using our web application framework (Express)

Developing in the Open

As with all of Open Knowledge International’s projects, the Open Data Survey is developed in the Open and available as Open Source software: Open Data Survey on Github.

Mobile data collection

- December 16, 2014 in data-collection, mobile, Skillhare, tech

This blog post is based on the School of Data skillshare I hosted on mobile data collection. Thanks to everyone who took part in it!
Of recent, mobile has become an increasingly popular method of data collection. This is achieved through having an application or electronic form on a mobile device such as a smartphone or a tablet. These devices offer innovative ways to gather data regardless of time and location of the respondent. The benefits of mobile data collection are obvious, such as quicker response times and the possibility to reach previously hard-to-reach target groups. In this blog post I share some of the tools that I have been using and developing applications on top of for the past five years.
  1.       Open Data Kit
Open Data Kit (ODK) is a free and open-source set of tools which help researchers author, field, and manage mobile data collection solutions. ODK provides an out-of-the-box solution for users to:
  • Build a data collection form or survey ;
  • Collect the data on a mobile device and send it to a server; and
  • Aggregate the collected data on a server and extract it in useful formats.
ODK allows data collection using mobile devices and data submission to an online server, even without an Internet connection or mobile carrier service at the time of data collection.   Screen Shot 2014-12-15 at 20.15.30 ODK, which uses the Android platform, supports a wide variety of questions in the electronic forms such as text, number, location, audio, video, image and barcodes.
  1.      Commcare
Commcare is an open-source mobile platform designed for data collection, client management, decision support, and behavior change communication. Commcare consists of two main technology components: Commcare Mobile and CommCareHQ. The mobile application is used by client-facing community health workers/enumerator in visits as a data collection and educational tool and includes optional audio, image, and audio, GPS locations and video prompts. Users access the application-building platform through the website CommCareHQ  which is operated on a cloud-based server. Screen Shot 2014-12-15 at 20.20.30 Commcare supports J2ME feature phones, Android phones, and Android tablets and can capture photos and GPS readings, Commcare supports multi-languages and non-roman character scripts as well as the integration of multimedia (image, audio, and video). CommCare mobile versions allow applications to run offline and collected data can be transmitted to CommCareHQ when wireless (GPRS) or Internet (WI-FI) connectivity becomes available.
  1.      GEOODK
GeoODK provides a way to collect and store geo-referenced information, along with a suite of tools to visualize, analyze and manipulate ground data for specific needs. It enables an understanding of the data for decision-making, research, business, disaster management, agriculture and more. It is based on the Open Data Kit (ODK), but has been extended with offline/online mapping functionalities, the ability to have custom map layer, as well as new spatial widgets, for collecting point, polygon and GPS tracing functionality. Screen Shot 2014-12-15 at 20.21.48 This one blog post cannot cover each and every tool for mobile data collection, but some other tools that can be used to accomplish  mobile data collection each of which having their own unique features includes OpenXData and Episurveyor. Why Use Mobile Technology in Collecting Data There are several advantages as to why mobile technology should be used in collecting data some of which include,
  •         harder skipping questions,
  •         immediate (real time) access to the data from the server, which also makes data aggregation and analysis to become very rapid,
  •         Minimizes workforce and hence reduces cost of data collection by cutting out data entry personnel.
  •         Data Security is enhanced through data encryption
  •         Collect unlimited data types such as audio, video, barcodes, GPS locations
  •         Increase productivity by skipping data entry middle man
·         Save cost related to printing, storage and management of documents associated with paper based data collection. flattr this!

Latin American experiences in data platforms: inegifacil.com

- June 5, 2014 in community, HowTo, School_Of_Data, tech

At Escuela de Datos we collect stories on the use and appropriation of data-related projects in Latin America and Spain. Today, we share Boris Cuapio Morales’ story – he is the developer behind INEGI Fácil, a tool that allows the easy search of data offered in the not-very-user-friendly web service of Mexico’s National Statistics Institute. We thank Boris for his time and help for this interview, which was carried out at Data Meetup Puebla. This post was originally published in Spanish by Mariel García, community manager at Escuela de Datos.yo-mero-v3


(español para seguir a continuación)

“When I was in university, I was always struck by the fact that media like The New York Times always linked to the websites of US-based information systems. I always asked myself: why is it that we don’t have such systems in Mexico, too?”
Boris Cuapio is an illustrator-turned-into-programmer that lives in Puebla, Mexico. In late 2012, he intended to use data from the National Statistics System (INEGI), and he found their web service. But he wasn’t qualified enough to use it. End of the story. That is, until late 2013. Boris had spent some time working for Canadian clients that requested products that incorporated the use of APIs of social networks like Twitter or Flickr, which forced him to learn what he needed in order to use the web service. His workmates encouraged him to start coding on his free time in order not to lose practice, and so he thought of a new project: to try and display INEGI data in an easier, more accessible way. That is how INEGI Fácil (Easy INEGI) was born. It is a website that queries the web service of inegi.gob.mx to show the results in tables and graphics. Is there value in the fact that a citizen, rather than the government, was behind this project? Boris thinks the speed of institutional processes would not allow the system to undertake the technological adoptions that are necessary in services of this sort. For example: while INEGI provides data in XML (a heavy format that has been gradually abandoned in other services), INEGI Fácil provides data in JSON, and with additional markers. INEGI has a tutorial that is rather difficult to grasp, whereas INEGI Fácil has a PHP library that makes the task simpler. Thanks to Hugo, the mastermind behind the design and interaction of the site. inegifacil_homev1 In reality, the government and Hugo are not competing. INEGI Fácil launched around July 2013, and in January 2014 Boris was contacted by someone at INEGI. In few words, they were surprised that someone was actually using the web service. When this service increases from its two current data sources to the hundred they expect to have, Boris will be a beta tester.
This project has allowed him to learn a lot about JS, XML, PHP, databases, dataviz – how to make graphics, how to export data. And he likes this type of work; he wants it to stop being a hobby, and rather become his main project. He wants to sell this product to universities, which are the institutions that use INEGI data the most. But, meanwhile, he hopes all the indexes will be searchable by the end of this month, and that INEGI Fácil will soon be accessible from mobiles. In a year, if he can find financing, he would hope for INEGI Fácil to become a web service paralell to the one of INEGI itself; in other words, that it is used by media outlets through site embeds. His dream: he wants his own university to start doing information design, graphics, instructional catalogues, educational texts and other materials based on the data they can extract through INEGI Fácil.
Tip from Boris (which, according to him, he has given to everybody even if he doesn’t know anyone other than him that has followed it): “Gather your money and buy SafariBooks when you know what it is you want to learn. I have learned most things through them!”
In Mexico, platforms based on open data are being developed. Do you know others in Latin America? We would love to share their stories at Escuela de Datos.
En Escuela recopilamos historias de uso y apropiación de proyectos dateros en Latinoamérica y España. Hoy les compartimos la de Boris Cuapio Morales, creador de INEGI Fácil – una herramienta que permite la búsqueda de datos del web service del Instituto Nacional de Estadística y Geografía en México. Agradecemos a Boris la disposición para esta entrevista, que fue llevada a cabo en Data Meetup Puebla. Este post fue publicado originalmente por Mariel García, community manager de Escuela de Datos. “En la universidad, me llamaba la atención que las notas en medios como The New York Times siempre se enlazaba a páginas web de sistemas de información de EEUU. Entonces yo me preguntaba: ¿Por qué no hay en México sistemas así?” Boris Cuapio es un ilustrador-convertido-en-programador que vive en Puebla, México. A finales de 2012, él tenía la intención de usar información del INEGI, y encontró el servicio web… pero no tenía la capacidad para usarlo. Fin de la historia.
…Hasta finales de 2013. Boris estuvo trabajando para clientes en Canadá, que pedían productos que incorporaban el uso de APIs de redes sociales como Twitter o Flickr, lo cual le capacitó para usar el servicio web. Sus compañeros de trabajo le recomendaron comenzar proyectos de programación personales para no perder práctica, él pensó en uno: tratar de mostrar los datos del INEGI de una manera más fácil y más accesible. Así nació INEGI Fácil, un portal que hace consultas al servicio web de inegi.gob.mx, y muestra los resultados en tablas y gráficas. ¿Pero por qué hay valor en que esto lo haga un ciudadano, y no el gobierno en sí? Boris piensa que la velocidad de los procesos institucionales no permitiría las adopciones tecnológicas que son necesarias en servicios de este tipo. Por ejemplo: mientras INEGI provee datos en XML (formato pesado que se ha ido abandonando en otros servicios), INEGI Fácil ya los da en JSON, y con marcadores adicionales. INEGI tiene un tutorial un tanto difícil de acceder, mientras que INEGI Fácil tiene una librería PHP que simplifica el trabajo. En términos de experiencia de usuario, no hay comparación (gracias a Hugo, la mente maestra detrás de los temas de diseño e interacción del sitio). inegifacil_homev1 En realidad, el gobierno y él no son competencia. INEGI Fácil inició cerca de julio de 2013, y en enero de 2013 Boris fue contactado una persona en el INEGI. En pocas palabras, les sorprendió que alguien de hecho usara los servicios web! Cuando el sitio pase de sus dos fuentes actuales de información a las cien planeadas, él será un beta tester.
El proyecto le ha permitido aprender mucho sobre JS, XML, PHP, bases de datos, dataviz – cómo hacer gráficas, cómo exportar datos. Y le gusta ese trabajo; quiere que ya no sea un hobby, sino su proyecto principal. Quiere vender el producto a universidades, porque es donde más se utiliza los datos del INEGI. Pero, por lo pronto, espera que en este mes los índices ya sean buscables y se pueda acceder a INEGI Fácil desde celulares. En un año, si consigue dinero, esperaría que se convierta un servicio web paralelo al INEGI; es decir, que lo utilicen medios de comunicación a través de embeds. Su ideal: le gustaría que en su universidad hagan diseño de información, gráficas, catálogos instructivos, textos educativos, y que aprovechen INEGI Fácil. Tip datero de Boris (que, según él, le da a todo mundo, y nadie ha seguido) – “Junta tu dinero y contrata SafariBooks. Con eso puedes aprender lo que sea”.
En México se desarrolla plataformas basadas en datos abiertos. ¿Conoces otras en Latinoamérica? Nos encantará difundir sus historias en Escuela.
flattr this!

Data Roundup, 16 April

- April 16, 2014 in air, Cities, Data Roundup, deaths, England, Google, InfoAmazonia, International journalism festival, Landline, Lobbying, pollution, ProPublica, resilient, Stateline, tech, top tweets, world

Ana_Cotta – saudades da Amazônia

Tools, Events, Courses On Wednesday the 30th, the eighth edition of the International Journalism Festival will take place in Perugia. The event has become one of the most important of its kind in Europe, and it will host hundreds of journalists from all over the world. The IFJ will also be the location of the third edition of the 2014 School of Data Journalism jointly organized by the European Journalism Centre and the Open Knowledge Foundation. The School will start on the May the 1st and will see the participation of 25 instructors from world-leading newspapers, universities, and think tanks. ProPublica just announced the release of two JavaScript libraries. The first one is Landline and will help developers turn GeoJSON data into SVG browser-side maps. The second is built on the previous one and is called Stateline and will facilitate the process of creating US choropleth maps. Data Stories Chris Michael from the Guardian Data Blog recently published a short article listing the world’s most resilient cities. Michael extracted data from a study of Grosvenor, a London-based company which measured resilience by assigning a value to cities’ vulnerability to environmental changes and their capacity to face political or economical threats. British citizens might be interested in the quality of air they breathe everyday. Those who are worried about air pollution should take a look at George Arnett’s interactive choropleth map showing the percentage of deaths caused by particulate air pollution in England. What’s the role of the world tech giants in politics? Tom Hamburger and Matea Gold tried to explain it in this article on the Washington Post by observing the evolution of Google in its lobbying activities at the White House. Google’s political influence increased enormously since 2002 thus making the company the second largest spender in the US on lobbying practices. Are conservatives all conservatives in the same way, or is there a certain degree of moderation among them and toward different issues? On his newly-born FiveThirtyEight, Nate Silver faces the argument by displaying data on the “partisan split” between the two US parties on several main topics. If you are Catholic, or maybe just curious, you should be very interested in seeing The Visual Agency’s last infographic, which represents through a series of vertical patterns the number, geographical area, and social level of professions of all Catholic saints. Gustavo Faileros, ICFJ Knight International Journalism Fellowship, is about to present to the public InfoAmazonia, a new data journalism site which will be monitoring environmental changes in the southern part of South America using both satellite and on-the-ground data. In addition, as environmental changes increase, so do the number of deaths of environmental and land defenders. The Global Witness team has just published its latest project, Deadly Environment, a 28-page report containing data and important insights on the rise of this phenomenon which is incredibly expanding year by year, especially in South America. Data Sources Michael Corey is a news app developer who was involved in the realization process of the National Public Radio mini-site named Borderland. In this post, he analyses the main features of the geographical digital tools that he used to collect and display data on the US-Mexico border which helped him correctly localizing the fences build by the US government all along the line which separates the two Countries. The data-driven journalism community is expanding rapidly, especially on Twitter. If you need a useful recap of what has been tweeted and retweeted by data lovers, then the Global Investigative Journalism Network #ddj top ten is what you need. flattr this!

Data Roundup, 12 March

- March 12, 2014 in acquisition, ampp3d, Companies, Data Roundup, data visualization, forbes, Google Analytics, knowledge is beautiful, lunch time, mccandless, NYT, R, record, tech, Upshot, weather radials, womentechafrica

Code – mutednarayan

Tools, Events, Courses Don’t miss the opportunity to design on of the page of Knowledge is Beautiful, the next book of David McCandless. The challenge is open until March 24 and is also well rewarded with a prize of a total of five thousand dollars. Ampp3d, the Trinity Mirror-owned data journalism site, launched its own competition too. Aspiring journalists have to develop a mobile-friendly data visualization which will be published on the Ampp3d website. The winner gets a hundred-pound prize. R is one of the top choices when it comes to programming languages for data visualization. Here you may find a tutorial from Daniel Waisberg on how to display Google Analytics Data with it. The New York Times is about to reveal Upshot, its new data-driven website based on politics and economics, which will replace Nate Silver’s FiveThirtyEight. Read some updates here. Data Stories This week we would like to start by presenting a series of infographics that are detailed as well as interesting. The funniest one is surely “Twelve world records you can break during your lunch hour”, posted by ChairOffice on Visual.ly. Big tech companies mean big business transactions. Watch this interactive explanation from Simplybusiness on the history of the biggest Tech Giants Acquisitions Among the others mentioned above, we strongly recommend you see Weather Radials, a poster representing all the climate changes occurring in 35 cities in the world last year, which is also a data visualization masterpiece to admire. For a deeper understanding of visualization, take a moment to read this article written by Dorie Clark on the Forbes website, which reminds us why “Data Visualization is the Future”. Data Sources See how tech enterprises and organizations are spreading across Africa in this map on WomenTechAfrica. The toolkit of a data addict is growing every day, and sometimes you have to choose the right tool for your own project. Here is a short list from Jerry Vermanen of software and programs that can be used for data extraction, filtering, and visualization. flattr this!

An Introduction to Cargo-Culting D3 Visualizations

- September 2, 2013 in d3, HowTo, javascript, tech, Tinkering, underscore, visualisation



D3 is a fantastic framework for visualization – besides its beautiful outputs, the framework itself is elegantly designed. However, its elegance and combination of several technologies (javascript + svg + DOM manipulations) make it very hard to learn D3. Luckily, there is a huge range of D3 examples out there. Cargo-Cults to the rescue… Cargo Cults were the way Melanesians responded when they suddenly were confronted with military forces in the second world war: strangers arrived wearing colorful clothes, marched up and down strips of land and suddenly: parachutes with cargo-crates were dropped. After the strangers left, the islanders were convinced that, if they just behaved like the strangers, cargo would come to them. As the islanders, we don’t have to understand the intricate details of D3 – we just have to perform the right rituals to be able to do the same visualizations… Let’s for example try to re-create a Reingold-Tilford tree of the Goldman Sachs network.
Things I’ll work with:
The Reingold-Tilford tree example
the Goldman Sachs Network over at Opencorporates (you can get the data in their json format) The first thing I always look at is: how is the data formatted for the example: Most of the examples provide data with it that help you to understand how the data is layed-out for this specific example. This also means: if we bring our data into the same shape, we’ll be able to use our data in the example. So let’s look at the “flare.json” format, that is used in the example. It is a very simple json format where each entry has two attributes: a name and an array of children – this forms the tree for our tree layout. The data we have from Open Corporates looks different, here we always have parent and child – but let’s bring this into the same form. First, I’ll get the javascript bits I’ll need: d3 and underscore (I like underscore for its functions that make working with data a lot nicer), then I’ll create an HTML page similar to the example. I’ll check in the code where the “flare.json” file is loaded and put my own json in it (the one from opencorporates linked above) (commit note- that changes later on, since opencorporates does not do CORS…). I then convert the data using a javascript function (in our case a recursive one, that crawls down the tree)
 var getChildren = function(r,parent) {
    c=_.filter(r,function(x) {return x.parent_name == parent})
    if (c.length) {
      return ({"name":parent,
        children: _.map(c,function(x) {return (getChildren(r,x.child_name))})})
      }
    else {
      return ({"name":parent })
    }
  }
  root=getChildren(root,"GOLDMAN SACHS HOLDINGS (U.K.)")
Finally, I adjust the layout a bit to make space for the labels etc The final result on githubfull repository flattr this!

Exploring IATI funders in Kenya, part I – getting the data

- August 21, 2013 in scraping, tech, Tinkering

The International Aid Transparency Initiative (IATI) collects data from donor organizations on various projects done within countries. Who donors fund is often an interesting question to ask. I will try to explore the donors who publish their data on IATI in this two-part little project in which we take a close look at funding activities in Kenya. First we will scrape and collect the data. I will use Python as a programming language – but you could do it in any other programming language. (If you don’t know how to program, don’t worry.) I’ll use Gephi to analyze and visualize the network in the next step. Now let’s see what we can find… The IATI’s data is collected on their data platform, the IATI registry. This site is mainly links to the data in XML format. The IATI standard is an XML-based standard that took policy and data wonks years to develop – but it’s an accepted standard in the field, and many organizations use it. Python, luckily, has good tools to deal with XML. the IATI registry also has an API, with examples on how to use it. It returns JSON, which is great, since that’s even simpler to deal with than XML. Let’s go! From the examples, we learn that we can use the following URL to query for activities in a specific country (Kenya in our example):
http://www.iatiregistry.org/api/search/dataset?filetype=activity&country=KE&all_fields=1&limit=200
If you take a close look at the resulting JSON, you will notice it’s an array ([]) of objects ({}) and that each object has a download_url attribute. This is the link to the XML report. First we’ll need to get all the URLs:
>>> import urllib2, json #import the libraries we need

>>> url="http://www.iatiregistry.org/api/search/" + "datasetfiletype=activity&country=KE&all_fields=1&limit=200"
>>> u=urllib2.urlopen(url)
>>> j=json.load(u)
>>> j['results'][0]
That gives us:
{u'author_email': u'iatidata@afdb.org',
 u'ckan_url': u'http://iatiregistry.org/dataset/afdb-kenya',
 u'download_url': u'http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIKenyaData.xml',
 u'entity_type': u'package',
 u'extras': {u'activity_count': u'5',
  u'activity_period-from': u'2010-01-01',
  u'activity_period-to': u'2011-12-31',
  u'archive_file': u'no',
  u'country': u'KE',
  u'data_updated': u'2013-06-26 09:44',
  u'filetype': u'activity',
  u'language': u'en',
  u'publisher_country': u'298',
  u'publisher_iati_id': u'46002',
  u'publisher_organization_type': u'40',
  u'publishertype': u'primary_source',
  u'secondary_publisher': u'',
  u'verified': u'no'},
 u'groups': [u'afdb'],
 u'id': u'b60a6485-de7e-4b76-88de-4787175373b8',
 u'index_id': u'3293116c0629fd243bb2f6d313d4cc7d',
 u'indexed_ts': u'2013-07-02T00:49:52.908Z',
 u'license': u'OKD Compliant::Other (Attribution)',
 u'license_id': u'other-at',
 u'metadata_created': u'2013-06-28T13:15:45.913Z',
 u'metadata_modified': u'2013-07-02T00:49:52.481Z',
 u'name': u'afdb-kenya',
 u'notes': u'',
 u'res_description': [u''],
 u'res_format': [u'IATI-XML'],
 u'res_url': [u'http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIKenyaData.xml'],
 u'revision_id': u'91f47131-529c-4367-b7cf-21c8b81c2945',
 u'site_id': u'iatiregistry.org',
 u'state': u'active',
 u'title': u'Kenya'}
So j['results'] is our array of result objects, and we want to get the
download_url properties from its members. I’ll do this with a list comprehension.
>>> urls=[i['download_url'] for i in j['results']]
>>> urls[0:3]
[u'http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIKenyaData.xml', u'http://www.ausaid.gov.au/data/Documents/AusAID_IATI_Activities_KE.xml', u'http://www.cafod.org.uk/extra/data/iati/IATIFile_Kenya.xml']
Fantastic. Now we have a list of URLs of XML reports. Some people might say, “Now
you’ve got two problems” – but I think we’re one step further. We can now go and explore the reports. Let’s start to develop what we want to do
with the first report. To do this we’ll need lxml.etree, the XML library. Let’s use it to parse the XML from the first URL we grabbed.
>>> import lxml.etree
>>> u=urllib2.urlopen(urls[0]) #open the first url
>>> r=lxml.etree.fromstring(u.read()) #parse the XML
>>> r
<Element iati-activities at 0x2964730>
Perfect: we have now parsed the data from the first URL. To understand what we’ve done, why don’t you open the XML report
in your browser and look at it? Notice that every activity has its own
tag iati-activity and below that a recipient-country tag which tells us which country gets the money. We’ll want to make sure that we only
include activities in Kenya – some reports contain activities in multiple
countries – and therefore we have to select for this. We’ll do this using the
XPath query language.
>>> rc=r.xpath('//recipient-country[@code="KE"]')
>>> activities=[i.getparent() for i in rc]
>>> activities[0]
<Element iati-activity at 0x2972500>
Now we have an array of activities – a good place to start. If you look closely
at activities, there are several participating-org entries with different
roles. The roles we’re interested in are “Funding” and “Implementing”: who
gives money and who receives money. We don’t care right now about amounts.
>>> funders=activities[0].xpath('./participating-org[@role="Funding"]')
>>> implementers=activities[0].xpath('./participating-org[@role="Implementing"]')
>>> print funders[0].text
>>> print implementers[0].text
Special Relief Funds
WORLD FOOD PROGRAM - WFP - KENYA OFFICE
Ok, now let’s group them together, funders first, implementers later. We’ll do
this with a list comprehension again.
>>> e=[[(j.text,i.text) for i in implementers] for j in funders]
>>> e
[[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')]]
Hmm. There’s one bracket too many. I’ll remove it with a reduce…
>>> e=reduce(lambda x,y: x+y,e,[])
>>> e
[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')]
Now we’re talking. Because we’ll do this for each activity in each report, we’ll
put it into a function.
>>> def extract_funders_implementers(a):
...     f=a.xpath('./participating-org[@role="Funding"]')
...     i=a.xpath('./participating-org[@role="Implementing"]')
...     e=[[(j.text,k.text) for k in i] for j in f]
...     return reduce(lambda x,y: x+y,e,[])
...
>>> extract_funders_implementers(activities[0])</code>
[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')]
Now we can do this for all the activities!
>>> fis=[extract_funders_implementers(i) for i in activities]
>>> fis
[[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')], [('African Development Fund', 'KENYA  NATIONAL  HIGHWAY  AUTHORITY')], [('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT')], [('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT')], [('African Development Fund', 'KENYA ELECTRICITY TRANSMISSION CO. LTD')]]
Yiihaaa! But we have more than one report, so we’ll need to create a
function here as well to do this for each report….
>>> def process_report(report_url):
...     try:
...         u=urllib2.urlopen(report_url)
...         r=lxml.etree.fromstring(u.read())
...     except urllib2.HTTPError:
...         return []  # return an empty array if something goes wrong
...     activities=[i.getparent() for i in </code>
...     r.xpath('//recipient-country[@code="KE"]')]</code>
...     return reduce(lambda x,y: x+y,[extract_funders_implementers(i) for i in activities],[])
...
>>> process_report(urls[0])
Works great – notice how I removed the additional brackets with using another
reduce? Now guess what? We can do this for all the reports! – ready? Go!
>>> fis=[process_report(i) for i in urls]
>>> fis=reduce(lambda x,y: x+y, fis, [])
>>> fis[0:10]
[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE'), ('African Development Fund', 'KENYA  NATIONAL  HIGHWAY  AUTHORITY'), ('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT'), ('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT'), ('African Development Fund', 'KENYA ELECTRICITY TRANSMISSION CO. LTD'), ('CAFOD', 'St. Francis Community Development Programme'), ('CAFOD', 'Catholic Diocese of Marsabit'), ('CAFOD', 'Assumption Sisters of Nairobi'), ('CAFOD', "Amani People's Theatre"), ('CAFOD', 'Resources Oriented Development Initiatives (RODI)')]
Good! Now let’s save this as a CSV:
>>> import csv
>>> f=open("kenya-funders.csv","wb")
>>> w=csv.writer(f)
>>> w.writerow(("Funder","Implementer"))
>>> for i in fis:
...     w.writerow([j.encode("utf-8") if j else "None" for j in i])
... 
>>> f.close()
Now we can clean this up using Refine and examine it with Gephi in part II. flattr this!

Scraping PDFs with Python and the scraperwiki module

- August 16, 2013 in pdf, Python, scraping, tech


While for simple single or double-page tables tabula is a viable option – if you have PDFs with tables over multiple pages you’ll soon grow old marking them. This is where you’ll need some scripting. Thanks to scraperwikis library (pip install scraperwiki) and the included function pdftoxml – scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate – that was quite tricky to scrape – I decided to protocol the process here. import scraperwiki, urllib2 First import the scraperwiki library and urllib2 – since the file we’re using is on a webserver – then open and parse the document… u=urllib2.urlopen("http://images.derstandard.at/2013/08/12/VN2p_2012.pdf")
#open the url for the PDF
x=scraperwiki.pdftoxml(u.read()) # interpret it as xml
print x[:1024] # let's see what's in there abbreviated...
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.22.5">
<page number="1" position="absolute" top="0" left="0" height="1263" width="892">
    <fontspec id="0" size="8" family="Times" color="#000000"/>
    <fontspec id="1" size="7" family="Times" color="#000000"/>
<text top="42" left="64" width="787" height="12" font="0"><b>TABELLE VN2Ap/1                         
                                                  30/07/13  11.38.44  BLATT    1 </b></text>
<text top="58" left="64" width="718" height="12" font="0"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) FÜR NEUGEBORENE KNABEN MIT </b></text>
<text top="73" left="64" width="340" height="12" font="0"><b>ÖSTERREICHISCHER STAATSBÜRGERSCHAFT 2012  - ÖSTERREICH </b></text>
<text top="89" left="64" width="6" height="12" font="0"><b> </b></text>
<text top="104" left="64" width="769" height="12" font="0"><b>VORNAMEN                  ABSOLUT      
%   
As you can see above, we have successfully loaded the PDF as xml (take a look at
the PDF by just opening the url given, it should give you an idea how it is
structured). The basic structure of a pdf parsed this way will always be page tags
followed by text tags contianing the information, positioning and font
information. The positioning and font information can often help to get the
table we want – however not in this case: everything is font=”0″ and left=”64″. We can now use xpath to query our
document… import lxml
r=lxml.etree.fromstring(x)
r.xpath('//page[@number="1"]')
[<Element page at 0x31c32d0>]
and also get some lines out of it r.xpath('//text[@left="64"]/b')[0:10] #array abbreviated for legibility
[<Element b at 0x31c3320>,
 <Element b at 0x31c3550>,
 <Element b at 0x31c35a0>,
 <Element b at 0x31c35f0>,
 <Element b at 0x31c3640>,
 <Element b at 0x31c3690>,
 <Element b at 0x31c36e0>,
 <Element b at 0x31c3730>,
 <Element b at 0x31c3780>,
 <Element b at 0x31c37d0>]
r.xpath('//text[@left="64"]/b')[8].text
u'Aaron *                        64       0,19       91               Aim\xe9                        
1       0,00      959 '
Great – this will help us. If we look at the document you’ll notice that there
are all boys names from page 1-20 and girls names from page 21-43 – let’s get
them seperately… boys=r.xpath('//page[@number<="20"]/text[@left="64"]/b')
girls=r.xpath('//page[@number>"20" and @number<="43"]/text[@left="64"]/b')
print boys[8].text
print girls[8].text
Aaron *                        64       0,19       91               Aimé                            1
   0,00      959 
Aarina                          1       0,00    1.156               Alaïa                           1
   0,00    1.156 
fantastic – but you’ll also notice something – the columns are all there,
sperated by whitespaces. And also Aaron has an asterisk – we want to remove it
(the asterisk is explained in the original doc). To split it up into columns I’ll create a small function using regexes to split
it. import re def split_entry(e):
return re.split("[ ]+",e.text.replace("*","")) # we're removing the asterisk here as well... now let’s apply it to boys and girls boys=[split_entry(i) for i in boys]
girls=[split_entry(i) for i in girls]
print boys[8]
print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959', u'']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156', u'']
That worked!. Notice the empty string u” at the end? I’d like to filter it.
I’ll do this using the ifilter function from itertools import itertools
boys=[[i for i in itertools.ifilter(lambda x: x!="",j)] for j in boys]
girls=[[i for i in itertools.ifilter(lambda x: x!="",j)] for j in girls]
print boys[8]
print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156']
Worked, this cleaned up our boys and girls arrays. We want to make them properly
though – there are two columns each four fields wide. I’ll do this with a little
function def take4(x):
if (len(x)>5):
return [x[0:4],x[4:]]
else:
return [x[0:4]] boys=[take4(i) for i in boys]
girls=[take4(i) for i in girls]
print boys[8]
print girls[8]
[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\xe9', u'1', u'0,00', u'959']]
[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\xefa', u'1', u'0,00', u'1.156']]
ah that worked nicely! – now let’s make sure it’s one array with both options in
it -for this i’ll use reduce boys=reduce(lambda x,y: x+y, boys, [])
girls=reduce(lambda x,y: x+y, girls,[])
print boys[10]
print girls[10]
['Aiden', '2', '0,01', '667']
['Alaa', '1', '0,00', '1.156']
perfect – now let’s add a gender to the entries for x in boys:
x.append("m") for x in girls:
x.append("f") print boys[10]
print girls[10]
['Aiden', '2', '0,01', '667', 'm']
['Alaa', '1', '0,00', '1.156', 'f']
We got that! For further processing I’ll join the arrays up names=boys+girls
print names[10]
['Aiden', '2', '0,01', '667', 'm']
let’s take a look at the full array… names[0:10]
[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],
 ['BLATT', '1', 'm'],
 [u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],
 [u'PHONETISCH',
  u'ZUSAMMENGEFASST,',
  u'ALPHABETISCH',
  u'SORTIERT)',
  u'F\xdcR',
  u'NEUGEBORENE',
  u'KNABEN',
  u'MIT',
  'm'],
 [u'\xd6STERREICHISCHER', u'STAATSB\xdcRGERSCHAFT', u'2012', u'-', 'm'],
 ['m'],
 ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
 ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
 ['m'],
 ['INSGESAMT', '34.017', '100,00', '.', 'm']]
Notice there is still quite a bit of mess in there: basically all the lines
starting with an all caps entry, “der”, “m” or “f”. Let’s remove them…. names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries
names=[i for i in itertools.ifilter(lambda x: not (x[0] in ["der","m","f"]),names)]
# remove all entries that are "der","m" or "f"
names[0:10]
[['Aiden', '2', '0,01', '667', 'm'],
 ['Aiman', '3', '0,01', '532', 'm'],
 [u'Aaron', u'64', u'0,19', u'91', 'm'],
 [u'Aim\xe9', u'1', u'0,00', u'959', 'm'],
 ['Abbas', '2', '0,01', '667', 'm'],
 ['Ajan', '2', '0,01', '667', 'm'],
 ['Abdallrhman', '1', '0,00', '959', 'm'],
 ['Ajdin', '15', '0,04', '225', 'm'],
 ['Abdel', '1', '0,00', '959', 'm'],
 ['Ajnur', '1', '0,00', '959', 'm']]
Woohoo – we have a cleaned up list. Now let’s write it as csv…. import csv
f=open("names.csv","wb") #open file for writing
w=csv.writer(f) #open a csv writer
w.writerow(["Name","Count","Percent","Rank","Gender"]) #write the header
for n in names:
w.writerow([i.encode("utf-8") for i in n]) #write each row
f.close() Done, We’ve scraped a multi-page PDF using python. All in all this was a fairly
quick way to get the data out of a PDF using the scraperwiki module. flattr this!

Using Redis as a rapid discovery and prototyping tool.

- April 11, 2013 in tech


Inspired by Rufus’ SQL for Data Analysis post, Iain Emsley wrote up his experience of using Redis for quick inspection of data. He writes: Recently I was looking at some data about tweets and wanted to get a better idea of the number of users, the number of messages per user, and store the texts to run ad-hoc queries and prototype a web interface. I felt that using a relational database would prove to unwieldy to make quick changes, so I thought I would try Redis. I had an idea of what I was looking for but was not entirely sure that the structures I had in mind would answer the query and using SQL potentially meant having to create or alter tables. Equally I wanted something which could be persistent rather than having to recalculate the raw data each time I wanted to look at it. Having stored the JSON data from the public search, I saved the data to a local directory which meant that I could comment out the URL handling and re-run the code with the same data each time. I had used Redis before for other projects and knew that it was more than just a key-value server. It supports sets, sorted sets, lists and hashes out of the box with a simple set of commands. It can also persist data as well as store it in memory. The website is an invaluable and well written resource for further Redis commands. Armed with this and some Python, I was ready to begin diving into the data. First, I created a set called ‘comments’ which contained all the relevant keys, such as names, to the dataset using the SADD command. By doing this, I can query what the names are at a later date, or query the amount of members the set has. As each tweet was being parsed, I added the name to the set which meant that I could capture every name but without duplicates. Using the user’s name as a key I would then build up a set of counts and store the tweets for that correspondent. As each tweet was being parsed, I created a key for the count, such as count::<name>. Against this I asked Redis to increase the count using INCR. Rather than having to check if the key exists and the increment the count, Redis will increase the count or just start it if the key doesn’t exist yet. As each count was created, I then stored the text as a part of a simple list against the correspondent using the RPUSH command and adding the new data to the end of the list. Using the key of tweets:: meant that I could store them and present to the web page at a later date. By storing the time in the value, I could run some very basic time queries but it also meant that I could re-run other queries to look at books mentioned, any mentions of other Twitter users (as I discovered Twitter’s internal representation does not appear to be complete; something which I discovered doing this work).

  import json
  import redis
  import glob
  import unicodedata 
  from urllib2 import urlopen, URLError
  
  rawtxt = '/path/to/data/twitter/'
  tag = '' #set this to be the tag to search: %23okfest
  
  for i in range(1,23):
      mievurl = 'http://search.twitter.com/search.json?q='+ tag +'&page='+str(i)
  
      turl = urlopen(mievurl)
      fh = open(rawtxt +str(i)+'.txt', 'wb').write(turl.read())
  
  r = redis.StrictRedis(host='localhost', port=6379, db=0)
  
  txtf = glob.glob(rawtxt +'*.txt')
  for ft in txtf:
      fh = open(ft).read()   
      
      data = json.loads(fh)
  
      for d in data['results']:
          if not d.get('to_user',None): d['to_user']  = ''
          
          #use the text is normalised into unicode
          d['text'] = unicodedata.normalize('NFKD', d['text'])
        
          r.sadd('comments', d['from_user']) #add user to the set
          r.incr('count::'+ str(d['from_user'])) #count how may times they occur
          r.rpush('tweets::'+ str(d['from_user']), str(d['created_at']) + '::'+unicode(d['text'])) #store the text and time
    
          #store any mentions in the JSON
          if 'None' not in str(d['to_user']):
              r.rpush('mentions::'+str(d['from_user']), str(d['to_user']))

  members = r.smembers('comments') #get the all the users from the set

  people=[m for m in members]
  counts=[r.get('count::'+member) for member in people]
  tweets=[r.lrange('tweets::'+m,0,-1)[0].split("::")[1] for m in people]

  #dump the raw counts to look at the data
  print counts
  print people
  print tweets


Having used a simple script to create the data, I then used some of the command line functions in Redis to view the data and also wrote a very simple website to prototype how it might look. As I had used Redis and stored the raw data, I was able to go back and rewrite or alter the queries easily to view more data and improve the results with a minimum of trouble. By using sets, I could keep track of which keys where relevant to this dataset. The flexibility of keys can allow slicing of data to explore it query, even add to it against different keys or even different data structures as needs change. Rather than having to know the schemas or rewrite SQL queries, Redis only really demands that you need to know the data structures that you want to use. Even if you get these wrong, changing them is a very quick job. It also means that when you retrieve the data, you can manipulate and re-present the data at the code level rather than having to potentially make large changes each time to a database. Due to its simplicity, I was able to “slice and dice” the data as well as create a quick web site to see if the visualisations might work. It has been a huge help in getting some ideas of the page and into code for some future projects. I’ll be keeping these tools in my tool set for the future. Recently did a data project and learned something new? Contact us and share it with our community flattr this!