You are browsing the archive for Adam Kariv.

Introducing Version 1 of the Fiscal Data Package specification

- May 28, 2018 in fiscal data package, Fiscal transparency, Open Fiscal Data, Open Spending

The Fiscal Data Package is a lightweight and user-oriented format for publishing and consuming fiscal data. Fiscal Data Packages are made of simple and universal components, are extremely flexible, can be produced from ordinary spreadsheet software and used in any environment. This specification started about five years ago with a first version (then known as the “Budget Data Package”). Since then we’ve made quite a few iterations, until a fairly stable version was reached, which we name ‘version 0.3’. This version was field-tested in various use cases and scenarios – most prominent among them is the Government of Mexico, who adopted Fiscal Data Package to be used for publishing their official budget data. For the past six months we’ve been hard at work in reshaping this specification to make it simpler to use and easier to adopt, while improving its flexibility and extensibility – thus making it relevant for more users. In many ways, this new version is the result of the collected experience and lessons learned in the past few years, working with partners and understanding what works and what doesn’t.

So what is the Fiscal Data Package philosophy?

The basic motivation behind Fiscal Data Package is to create a specification which is open by nature – based on other open standards, supported by open tools and software, modular, extensible and promoted transparently by a large community. The Fiscal Data Package is designed to be lightweight and simple to use – providing a small but flexible set of features, based on real-world requirements and not theoretical ones. All the while, the built-in extensibility allows this spec to adapt to many different use cases and domains. It is also possible to gradually use more and more part of this specification – progressive enhancement – thus making it easier to implement with existing data while slowly improving the data quality. A main concern we wanted to tackle was the ability to work with data as it currently exists, without forcing publishers to modify the contents or structure of their current data files in order to “adapt” them to the specification. This is a big deal, as publishers often publish data that’s the output of existing internal information systems, and requiring them to do any sort of data cleaning or wrangling on the data prior to uploading in a major source of friction for adoption.

And what is it not?

With that in mind, it’s also important to understand what this specification doesn’t handle. This specification is, by design, non-opinionated about which data should be published by publishers – which datasets, which fields and and the internal processes these reflect. The only things Fiscal Data Package is concerned with are how fiscal data should be packaged and providing means for publishers to best convey the meaning of the data – so it can be optimally used by consumers. In addition to that, it provides details regarding file-formats, data-types, metadata and structuring the data in files.

What we learned

As previously mentioned, via a wide range of technical implementations, partner piloting, and fiscal data projects with other civic tech and data journalist partners, we’ve learned a lot about what works in Fiscal Data Package v0.3, and what does not. We want to take these learnings and make a more robust and future proof v1.0 of the specification. One of the first thing we noticed wasn’t working was fiscal modelling. Version 0.3 of the specification contained an elaborate system for the modelling of fiscal data. In practice, this system turned out to be too complicated for normal users and error prone (inconsistent models could be created). To add to that, modelling was not versatile enough to account for the very different source files existing with real users, nor was it expressive enough to convey the specific semantics required by these users. A few examples of this strictness include:
  • The predefined set of classifications for dimensions. This hard-coded list did not capture the richness of fiscal data ‘in the wild’, as it contained too few and too broad options.
  • Measure columns were assumed to be of a specific currency, disregarding datasets in which the currency is provided in a separate column (or non monetary measures).
  • Measure columns were assumed to be of a specific budgeting phase (out of 4 options) and of a single direction (income/expenditure), ignoring data sets which have different phases, or that the phase or direction are provided in a separate column – or data sets which are not related to budgets altogether…
Another lesson learned is about file formats. Contrary to what its name might suggest, the world of fiscal data files is a wild jungle – every sort and form of file exists there (if you just look hard enough). Now, while machines will always prefer to read data files in their denormalised (or unpivoted) form – as it’s the most verbose and straightforward one – publishers will often choose a more compact, pivoted form – and as the proverb goes, there is more than one way to pivot a table. Other publishers would take out from the file some of the data, and append it as a separate code list file, or split large files based on year, budget direction or department. Version 0.3 of the specification assumed data files would only be provided in a very specific pivoted form – which might apply to some cases, but practically failed on many other variations that we’ve encountered.

Many different variations

What new features does Fiscal Data Package v1.0 provide?

First of all, it introduces a novel and simple way for supporting a wide variation of data file structures – pivoted and unpivoted, with code-lists and without them, provided in a single file or spanning across multiple files. To do that we’ve added 3 different devices:
  • We added the concept of ‘constant fields’: while still supporting any form of metadata added to the Fiscal Data Package descriptor, adding a field with some constant data is often a cleaner and more complete way for adding missing information to the dataset.
  • Added built-in facility for ‘unpivoting’ (or de-normalising) the source data: data is no longer expected to be provided in a very specific pivoted form – any structure of the data is now supported.
  • Use of Foreign Keys for allowing use of code-lists as part of the specification.
When we know the structure of the data, it allows us to bring all datasets to a single structure. This is crucial for comparisons – how can we compare two datasets when their structure is different? When the structure is known, it’s easier to ask questions about the data and easily refer to a single data point in the data (e.g. “what was the allocated budget for this contract in 2016?”).

Denormalisation

The second big feature of Version 1 is the introduction of ColumnTypes. ColumnTypes are a lightweight taxonomy for describing the columns of a fiscal data file – that is, not the concepts but their representations. For example, these types are not concerned with ‘Deficit’, ‘Supplier’ or ‘Economic classification’ – these are fiscal concepts. However, when put into a data file, columns such as ‘Supplier last name’ or ‘Title of 2nd level of func. class. in Dutch’ might be used. ColumnTypes are concerned with the data files themselves – and provide a way to extract the concept out of the columns. ColumnTypes can be combined into taxonomies of similarly-themed types. In these taxonomies, it’s possible to define some relationships between different types – for example, indicate a few ColumnTypes are parts of a more abstract concept. It’s also possible to assign data types and validation rules to a ColumnType, and more. Alongside this specification we’re also releasing two fiscal taxonomies which serve as standards for publishing budget files and spending files. These can be found here:

What’s next?

This announcement is of a release candidate – we’re looking forward for getting feedback and collaborating with the open-data and fiscal-standard communities. We’re planning to update existing tools (such as OpenSpending) and to build new tools to support this specification and provide integrations for other systems. Lastly – all this work wouldn’t have been available without the support and collaboration with our partners – chief among them are GIFT – Global Initiative for Fiscal Transparency, as well as the International Budget Partnership, Omidyar Network, google.org, The World Bank, the Government of Mexico and many other pilot governments. We thank them all for generous support in making this work possible. We really believe that Fiscal Data Package is an opportunity for governments and organisations that see the benefit in publishing budgets to foster transparency as part of a liberal democracy. You are invited to join us on this journey, which many government partners such as Croatia, Guatemala, Burkina Faso and Mexico have already started.
It is needed more than ever.

Some Misconceptions about Data Journalism

- October 27, 2016 in advocacy, Data Journalism, Hacktivism, journalism, Open Data

This blog originally appeared on Medium, https://medium.com/@adam.kariv/some-misconceptions-about-data-journalism-8c911e743ef8#.begd19gf4

 

In the past few years, a new discipline in journalism is slowly getting more and more followers — a discipline commonly known as ‘data journalism’. These so-called ‘data journalists’ are usually envisioned as the younger, tech savvy journalists, ones that are not afraid to analyse data, understand how computer code works and simply love these colourful and detailed visualisations.

On the other end of the scale are the non-data-journalists . We usually imagine them, still using a phone and Rolodex as they simply don’t get email — and the last technological leap they made was when the mechanical typewriters were replaced by computerised word processors.

Moving away from these simplistic (even stereotypical) dichotomies into a better understanding of what a data journalist actually looks like, will do justice to the actual hard-working data-journalists out there as well as take this movement forward and make it more open and inclusive.

The Python vs. Rolodex dilemma

Let’s begin with the ground truth about the journalism trade: Journalism is all about telling a story, and the best stories are ones that revolve around humans, not numbers.

This basic fact was true a hundred years ago, and is not about to change — even if technology does. For this reason, the best journalists will always be the masters of words; those who have the best understanding of people and what makes them tick. It is the unfortunate truth that the benefit of knowing how to work with data will always come after that.

Don’t get me wrong, there’s certainly a place for all the ‘visualisation-oriented journalists’ (or “visi-journalists”). That’s because sometimes the data is the story. Sometimes, the fact that some new data is available to the public is newsworthy. Sometimes, some hard-to-find, hidden links in a large dataset are the scoop. Sometimes, a subject is too technical and complex that only a super-interactive visualisation is the only way to actually explain it. But most times, this is not the case.

So we have on one end of the spectrum, that old school journalist with her Rolodex, holding a precious network of high-ranking sources. On the other extreme, a journalist that also codes and wrangles data, trying to find a corruption case by sifting through publicly available data using a custom made Python script. But in between these two extremes, lies a vast range of hard-working journalists, reporting on the day to day happenings in politics, economy, foreign affairs and domestic issues. These journalists don’t have any sources in any high places, and have never heard of Python.

Yet, this majority of journalists is mostly ignored by the data journalism movement — which is a shame, as these are the ones most likely to benefit from it and advance it the most.

A website is not a source

Flashback to five years ago — I’m one of the few founding-volunteers of an open-data NGO in Israel, “The Public Knowledge Workshop”. One of our first projects was called “The Open Budget” — a website who took the publicly available (but hard-to-understand) national budget data and presented it in a feature-rich, user friendly website.

At that time, we tried to meet with as many journalists as we could to tell them about the new budget website — and not many would spare an hour of their busy schedules for some geeks from an unknown NGO. We would show them how easy it was to find information and visualise it in an instant. Then we would ask them whether they might consider using our website by themselves for their work.

A common answer that took me by quite a surprise always went along the lines of “That is very nice indeed but I don’t need your website as I have my sources in the Ministry of Finance and they get me any data I need”. The fact that the data was lying there, within a mouse-click’s reach, and they still wouldn’t use it — simply baffled me. It took me some time to understand why it made perfect sense.

Nevertheless, we would offer ourselves to these journalists as domain experts in understanding and analysing government data (or even knowing where to find that data) — and as volunteer ‘data wranglers’. In theory, it was supposed to be a mutually beneficial relationship: they needed help with getting the right data in their stories, and we were a young NGO, hungry for some media spotlight. In practice, this situation resulted in too many articles where we would do the work but would not be credited for it. Journalists would ask for some budget related data analysed for an article with a tight deadline. We would do our part, only to find the data attributed in the printed paper to the Ministry of Finance. As annoying as it was, they would always claim that they cannot give us credit as “No one knows who you are. We need someone with some credibility”…

Getting an answer is a human thing

So what is the reason, really, that journalists will not use an official government open-data web-site to get data and for fact-checking?

I remember one time a journalist calling me with a very simple question:

– ‘Can you tell me the total size of this year’s national budget?’ – ’Sure, but did you try our website? It’s the one single big number right there on the homepage.’ – ‘Umm… there are a few other numbers there. Can you please copy-paste the correct one and send it to me in an email?’

And so I did.

Was that reporter lazy? Perhaps. But it wasn’t just that. As it turns out, it’s not just a matter of credibility — it’s also a matter of attribution. Journalistic reporting is a delicate art of telling a narrative using only “facts”, not the journalist’s own personal opinions. Journalistic facts (which may be just someone else’s opinion) need to always be attributed to someone, be it a person or an organisation.

So you’d get sayings similar to this: ‘according to this NGO, spending on health in the national budget is 20%’. This sort of wording leaves room for other parties to claim the analysis was wrong and the actual number is different. It keeps journalists free from biases — and from accusations of such biases — while still promoting a specific world view.

The only catch is that this only works if they are solely reporting these interpretations — not making them.

Getting the right answer is also a human thing

As time passed and the number of journalists seeking our help constantly grew, a new understanding slowly emerged. We were no longer just the geeks with the best budget data in town, but we became also the geeks that know the most about the intricacies of the budgeting cycle, tenders and procurement processes.

Geeks in action

All of a sudden we were able to answer more vague questions from journalists. Take this question as an example – “how much money is a specific company getting from the government?”. To answer that, you first need to know what options there are to ‘get money from the government’ (there are at least three or four). Then you need to know how to query the data correctly to find the actual data rows that answer the question. You might find that a single company is in fact more than one legal entity. You could discover that it’s being called differently in different data sources. Some data sources might contain data that’s partly overlapping. And after all that work you still need to produce an answer that is (most likely) correct and you can wholeheartedly stand behind it.

Getting to such a level of expertise is not something that happens in a day. This is another reason why open-data portals are simply not that useful for journalists. Even if the journalist has a clue as to which dataset contains an answer to her question — which is rarely the case, nor that a single dataset will hold the answer — it’s not enough to see the data, you need to make sense out of it. You need to understand the context. You need to know what it really means — and for that, you need an expert.

When Open Data takes the Lead

With deep knowledge of data, arrive interesting findings. Most are standard cases of negligence with public funds. Some are interesting insights regarding money flows that are only visible when analysing the ‘big picture’. Only rarely you find small acts of corruption. We believed that each of these findings was newsworthy, and we would try to find journalists that might take our leads and develop them into a complete story.

But hard as we tried, our efforts were in vain — none of the methods we tried seemed to be working. We tweeted our findings, wrote about them in our blog, pushed them hard through facebook — we even got a Telegram bot pushing algorithmically detected suspicious procurements in real time! But journalists were not impressed.

On other instances, we managed to get a specific journalist interested in a story. The only problem was that sometimes they would hold on that piece of information for weeks without doing anything with it until it became irrelevant — thus losing our chance to use it anywhere else.

At that point we decided to get some help from an expert, and hired a PR manager to help our efforts to get the message across. Seeing him work with journalists left me in awe: his ability to match the right story to the correct person, ensure we were always credited properly, that stories were written promptly was something we’d never seen. And the best part was how he was leveraging his many connections to make journalists come to us for the next story instead of the other way round.

But he also made us change our ways a little bit — as good leads needed to be kept secret until a good match was found. Exclusivity and patience bought us larger media coverage and a wider reach — but with the price of compromising on our open-data and transparency ideologies.

Data is a Source

Back to present day.

We still meet journalists on a regular basis. and although it’s now easier to get their attention, most of them would still start our meetings with a skeptical approach. They look as if they wonder ‘what are they trying to sell me?’ and ‘how on earth these geeks could have anything to do with my work?’.

But then we start talking — first we tell them about our different projects and areas of expertise, and the conversation flows to what they’re interested in: what are the ideas they’re trying to promote? which big projects they’ve always dreamt of doing but never had the data? They tell us about all their attempts to get data from the government through FOIA requests that ended in hitting brick walls.

That’s usually the point where I take out my laptop. They seem baffled when I start typing a few SQL commands on my terminal, and utterly surprised when after two or three minutes I present them with a graph of what they were looking for. “Wow, I didn’t know it was even possible… and all of that just from data that’s out there?” they say, with a smile and a new sparkle in their eyes. And that’s when I know — a new data-journalist was born.

Every once in a while, a beautifully interactive data visualisation project is published by one of the media outlets. Everybody applauds the “innovative use of the medium” and the “fine example of data-journalism” — and I’m also impressed! — but to me this is simply forgetting all these other journalists who made that leap into the world of data.

These journalists understand that leads come not just from sources in the government, but also from algorithms analysing CSV files. They cautiously learn to link to the government data portals as proof for their claims. They take data and make it a part of their story.

These are the true heroes of the data-journalism revolution. And the motto of this revolution cannot be ‘Visualise More!’ or ‘Use Big Data!’ — it must be: ‘Data is a Source’.


Thanks to Paul Walsh for the encouragement and to Nir Hirshman for being that awesome PR guy…