De-anonymising Ukraine university entrance test results

Vadym Hudyma - May 26, 2017 in Data Blog, Ukraine

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

External Independent Evaluation Testing is a single exam is used nationwide to access all public universities. As detailed in our previous article, the release of a poorly anonymised dataset by organisation in charge of the External Independent Evaluation Testing (EIT) resulted in serious risks to the privacy of Ukrainian students. One of those was the risk of unwanted mass disclosure of personal information, with the help of a single additional dataset. We detail below how we reached our results. The EIT datasets contains the following dimensions:
  • Unique identifier for every person
  • Year of birth
  • Sex
  • Test scores of every subject taken by student (for those who get 95% and more of possible points – exact to decimals)
  • Place, where test were taken
On the other hands, the dataset we used to de-anonymise the EIT results, was collected from the website vstup.info, and it gives us access to the following elements:
  • family name and initials of the applicant (also referred below to as name)
  • university where the applicant was accepted
  • the combined EIT result scores per required subject, with a multiplier applied to each subject by the universities, depending on their priorities.
At first glance, as every university uses its own list of subject-specific multipliers to create the combined EIT results of applicants, it should be impossible to precisely know their EIT score, as well as find matches with exact scores in EIT data set. The only problem with that reasoning is that the law requires all the multipliers to be published on the same website as a part of a corruption prevention mechanism. And this is good. But it also provides attackers with enough data to use it as a basis for calculation to find exact matches between datasets.

How we did it

Our calculations were based on an assumption that every EIT participant applied to universities of their local region. Of course, this assumption may not be true for every participant but it’s usually the case and also one of the easiest ways to decrease the complexity of the calculations. For every Ukrainian region, we isolated in the EIT dataset a subset of local test-takers and calculated the EIT ratings they would have if they had applied for every speciality at local universities. Then we merged this dataset of “potential enrollees” with real enrollees’ dataset from website vstup.info, which contained real names of enrollees and their final rating (meaning multiplied by subject- and university specific multipliers) by the parameters of the university, speciality, and rating. By joining these data sets for every region we gained the first set of pairs, where test-takers’ ids correspond with enrollees’ names (data set A1B1). In the resulting set, the quantity of EIT participants that correspond with only one name, i.e. those who can be unambiguously identified, is 20 637 (7.7% of all participants). To expand the scope of our de-anonymization, we used the fact that most of the enrollees try to increase their chances of getting accepted by applying to several universities. We consequently tested all pairs from first merged dataset (A1B1) against the whole dataset of enrollees (B1), counting the number of matches by final rating for the every pair. Then we filtered the pairs that were matched by at least two unique values of EIT rating. If the same match occurs in two cases with different universities/speciality coefficients to form aggregate EIT rating, it’s much less likely that we got a “false positive”. Therefore, we formed a data set where each EIT participant’s id corresponds with one or more names, and the number of unique EIT rating values is recorded for every correspondence (C1). In this case, the number EIT participant (unique identifier from A1) that correspond only one name with the number of unique aggregate ratings > 1, is 50 845 (18.97%). We also noticed the possibility of false positive results, namely the situation where the same family name and initials from enrollees dataset (B1) corresponds with several ids from EIT participants dataset (A1). It doesn’t necessary mean we guessed test taker’s family name wrongly, especially in a case of rather common a family name. The more widespread name is, the more the probability that we have correctly identified several EIT participants with the same name. But still it leaves possibilty of some number of false positive results. To separate the most reliable results from others, we identified correspondences with unique names and calculated the number of the records where unique id corresponds with a unique name. Consequently, the results of our de-anonymization can be described by the following table.
Assumptions De-anonymized EIT participants with unique names De-anonymized EIT participants (regardless of names uniqueness)
1) Every enrollee applied to at least one university in his/her region. 8 231 (3.07%) 20 637 (7.7%)
1) + Every enrollee applied to at least two specialities with different coefficients. 31 418 (11.42%) 50 845 (18.97%)
In each row, false positive results can occur only if some of the enrollees broke basic assumption(s). So far we speaking about unambiguous identification of test-takers. But even narrowing results to a small number of possible variants makes subsequent identification using any kind of background knowledge or other available data sets trivial. At the end, we were able to identify 10 and less possible name-variants for 43 825 EIT participants. Moreover, we established only 2 possible name-variants for 19 976 test-takers. Our method provides assumed name(or names) for every EIT participant, who applied to university in the region where they had taken their tests, and applied to at least two specialities with different multipliers. Though not being 100% free from false positives, the results are precise enough to show that external testing dataset provides all necessary identifiers to de-anonymize a significant part of test-takers. Of course, those who may have personal or business, and not purely research interest in test-takers’ personalities or enrollees external testing results would find multiple ways to make de-anonymization even more precise and wider in its scope. (NOTE: For example, one can use clusterization of each specialty rating coefficients to decrease the number of calculation avoiding our basic assumption. It is also possible to take into account the locations of EIT centres and assume that test-takers would probably try to enrol at the universities in nearby regions or to estimate real popularity of names among enrollees using social network “Vkontakte” API and so on.) Using comparatively simple R algorithms and an old HP laptop we have found more than 20 637 exact matches (7.7% of all EIT participants), re-identifying individuals behind anonymized records. And more than 40 thousands – participants were effectively de-anonymised with less than perfect precision – but more than good enough for motivated attacker.

What could be done about it?

After conducting initial investigation, we reached out to CEQA for comments. This was their response: “Among other things, Ukraine struggles with high level of public distrust to government institutions. By publishing information about standardized external assessment results and the work we deliver, we try to lead by example and show our openness and readiness for public scrutiny… At the same time, we understand that Ukraine has not yet formed a mature culture of robust data analysis and interpretation. Therefore, it is essential to be aware of all risks and think in advance about ways to mitigate adverse impact on individuals and the education system in general.” So what could be done better with this particular dataset to mitigate at least the above mentioned risks, while preserving its obvious research value? Well, a lot. First of all, a part of the problem that is easy to fix is the exact test scores. Simple rounding and bucketing them into small portions (like 172 instead of the range from 171 to 173, 155 for the range from 154 to 156 and so on), and so making them reasonably k-anonymous. Whilst this wouldn’t make massive deanonymization impossible, it could seriously reduce both the number of possible attack vectors and the precision of these breaches. “Barnardisation” (adding 1 and -1 randomly to each score) would also do the trick, though it should be combined with other anonymisation techniques. The problem with background knowledge (like in the “nosy neighbour” scenario) is that it would be impossible to mitigate without removing a huge number of outliers and specific cases, such as small schools, non-common test subjects in small communities and so on, as well as huge steps in bucketing different scores or generalising test locations. Some educational experts have raised concerns about the projected huge loss in precision. Still, CEQA may have considered releasing dataset with generalised data and some added noise and give researchers more detailed information under a non-disclosure agreement. This “partial release/controlled disclosure” scheme could also help to deal with the alarming problem of school ratings. For example, a generalisation of testing location from exact places to school districts or even regions would probably help. Usually, local media wouldn’t be interested in comparing EIT results outside their audience locations, and national media is much more reluctant to publish stories about differences in educational results between different regions for obvious discrimination and defamation concerns. This kind of attack is not very dangerous at this particular moment in Ukraine – we don’t have a huge data-broker market (as in US or UK) and our HR/insurance companies do not use sophisticated algorithms (yet) to determine the fate of peoples’ job applications or final life insurance cost. But the situation is quickly changing, and this kind of sensitive personal data, which isn’t worth much at this point, can be easily exploited at any moment in the near future. And both the speed and low cost of this kind of attack make this data set a very low hanging fruit. Conclusions Current states of affairs in personal data protection in Ukraine, as well as workload of existing responsible stuff in government don’t leave much hopes for a swift change in any of already released data sets. Still, this case clearly demonstrates that anonymisation is really hard problem to tackle, and benefits of microdata disclosure could be quite easily outweighed by possible risks of unwanted personal data disclosures. So, all open data activists advocating for disclosure maximum information possible, as well as government agencies responsible for releasing such sensitive data sets, should put really hard efforts into figuring out possible privacy connected risks. We hope that our work would be helpful not just for future releases of external testing results, but for the wider open data community – both in Ukraine and throughout the world. Flattr this!

Open Data Index in Brazil launched! by FGV and Open Knowledge Brazil

Open Knowledge Brazil - May 25, 2017 in network, Open Data, Open Data Index

Open Knowledge Brazil and Fundação Getúlio Vargas (FGV) – a higher education institution in Brazil worked together to develop the Brazilian edition of the Open Data Index, which is being used by governments as a tool to enhance public management, and bring it even closer to Brazil’s reality. 

About the Open Data Index

The Brazilian edition of the Open Data Index has been used as a tool to set priorities regarding transparency and open data policies, as well as a pressure mechanism used by civil society to encourage governments to enhance their performance, releasing sets of essential data. The indicator is based on data availability and accessibility across 13 key categories, including government spending, election results, public acquisitions, pollution levels, water quality data, land ownership, and climate data, among others. Submissions are peer reviewed and verified by a local team of data experts and reviewers. Points are assigned based on the conclusions reached through this process.

OK Brazil and FGV Partnership 

Through a series of events held in partnership with Open Knowledge Brazil (OKBR) and FGV’s Department of Public Policy Analysis (DAPP) launched the Brazilian edition of the Open Data Index (ODI) – a civil society initiative designed to assess the state of open government data worldwide. Three assessments were established for Brazil through a joint effort between the two institutions:
  1. Open Data Index (ODI) for Brazil, at the national level, 
  2. ODI Sao Paulo at the municipal level and
  3. ODI Rio de Janeiro, also at the municipal level
The last two are part of a pioneering initiative, since these are the first regional ODIs in Brazil, in addition to the nationwide assessment. 
This partnership with OKBr and the development of the Open Data Index complement DAPP’s life-long efforts in the areas of political and budget transparency, featuring widely recognised tools such as the Budget Mosaic and Transparent Chamber. We believe that public debate can only be qualified through data transparency, social engagement and dialogue within network society –  Marco Aurelio Ruediger, director of DAPP

The two institutions are working to develop the indicator used by governments across 122 countries as a tool to enhance public management and bring it even closer to Brazil’s reality. The goal is for data disclosure to promote institutional development by encouraging transparency within the government’s foundations, achieved both through constant scrutiny by civil society and improvements implemented by administrators regarding the quality and access to information.
Among the practical results of this new effort for society is the possibility of using results to develop and monitor public policies regarding transparency and open data – Ariel Kogan, CEO of OKBR

Open Data Index for Brazil 

The Open Data Index for Brazil, launched on April 27 in Brasilia, revealed that the country is in 8th place in the world ranking, tied with the United States and Latvia, and it occupies the leadership among its neighbours in Latin America. In total, 15 dimensions related to themes such as public spending, environment and legislation were analysed. However, the overall score of 64% indicates that there is still a lot of room for improvement. Only six — or 40% — dimensions of the index received the total score, that is, they were considered totally open: Public Budget, Electoral Results, National Maps, Socioeconomic Statistics, Laws in Force and Legislative Activity. However, no public databases were found for three dimensions surveyed: Locations, Water Quality and Land Ownership.

Open Data Index for Cities – São Paulo

The ODI São Paulo, launched two days earlier, had a similar result. In the overall assessment, the municipality had a positive result in the index, with 75% of the total score. Within the index analysis dimensions, 7 of the 18 evaluated databases obtained a maximum score: this means that 38% of the databases for the city were considered fully open. On the other hand, the Land Ownership dimension was evaluated with 0%, due to the unavailability of data; and another four had a score lower than 50% (Business Register, Water Quality and Weather Forecast).

Open Data Index for Cities – Rio de Janeiro

The ODI Rio de Janeiro [report in Portuguese], released on May 4, showed a slightly different performance. The city of Rio de Janeiro had a high overall score, reaching 80%. The study indicates, however, that only five dimensions (Election Results, City Maps, Administrative Limits, Criminal Statistics and Public Schools) had the individual score of 100%, with only 27% of the databases being considered fully open. The incompleteness of the dataset appears six times, i.e. there is no availability of certain information which is considered essential. The issue of access restriction appears only in the Business Register dimension. The Land Ownership dimension is also considered critical, since there is no data available for carrying out the ODI assessment. In summary, it is believed that the information can be useful for an open data policy at the municipal and federal level, to provide the paths for the replication of good practices and the correction of points of attention. The benefits of an open data policy are innumerable and include the extension of management efficiency, the creation of an instrument for collecting results from public administration, promoting accountability and social control, engaging civil society with public management and improving the public image, with the potential of becoming an international reference

Photographs of Sea Stars (1917)

Adam Green - May 24, 2017 in Ludwig Heinrich Philipp Döderlein, marine life, ocean, sea, sea stars, Siboga Expedition, starfish

Strangely alluring images from a report by German zoologist Ludwig Heinrich Philipp Döderlein on starfish collected during the Siboga Expedition around Indonesia.

Datensummit: Advancing open data in Germany

Lieke Ploeger - May 24, 2017 in OK Germany

Last month Open Knowledge Germany hosted the first Datensummit, a two-day festival for those who shape development within the fields of open data, transparency, data literacy and civic tech. With OK Germany existing for over five years already, it was a good moment to both look back on developments in open data, civic tech, transparency and civil participation in Germany, but more importantly, to bring the community together and stimulate future inspiration on how to advance open data in Germany even more. Datensummit 2017 - Tag 1 im BMVi (Foto: Leonard Wolf) During the first day at the German Federal Ministry of Transport and Digital Infrastructure (BMVI, also sponsor of the event) the focus was on fostering interdisciplinary exchange with politicians and public administration, with talks by OK Germany staff and international speakers. The second day was structured in an unconference format, with opportunities to exchange ideas, develop and plan new open data projects in barcamps and workshops. The impressive program of the first day attracted nearly 300 participants. Nadine Stammen, part of the organising team of the Datensummit, shares more information on the speakers:   The location of the German Ministry was strategically chosen, precisely to encourage further collaboration between government and CSOs advocating for open data. As Christian Heise, Chairman of the Board of the OK Germany, stated:
The Datensummit 2017 has shown how government and civil society can work together to demonstrate why open data and open knowledge are useful to society as a whole, and that intransparent governmental and administrative action is no longer an option.

Elisa Lindinger, a member of the OK Germany team, talks about the contact with the Ministry, and the strength of the German open data community:   The organising team also cleverly stimulated participants of different backgrounds to mix and talk to each other: during the registration, all participants received three coloured bracelets based on the type of organisation they work for (for example pink for NGOs and green for government representatives). Whenever you talked to someone, you could swap bracelets, with the aim of ending up with as many different colours as possible of course. OK Germany showcased the breadth of the open data field that they are working on, with staff presenting their work on projects around freedom of information and politics (such as FragdenStaat.de, a platform through which people can easily submit FOIA requests in Germany), the economic potential of open data and a summary of the current state of the Code for Germany community (which brings together developers, designers and those interested in open data in 25 local groups across Germany). Under ‘Civic Tech inspirations’, winners of the first round of the Prototype fund (a publicly funded program for non-profit software in civic tech, data literacy, and data security in Germany) showcased their projects, and the Datenschule, the German brach of the School of Data, brought together representatives from the international School of Data Network to discusses data literacy approaches and digital NGO projects. Elisa Lindinger shared her thoughts on the current state of open data in Germany:
Datensummit 2017 - Tag 2 im betahaus In addition, invited external speakers added valuable perspectives on data: from insights around ethical data handling (Zara Rahman – About people, data and good intentions), engaging volunteers in analysing data on human rights violations (Milena Marin on the Amnesty Decoders project) and the value of a German transparency register for investigating tax evasion and money laundering (Vanessa Wormer on her work on the Panama Papers) to the beauty and potential of hand-drawn data visualisations to make data more accessible and understandable (Stefanie Posavec – Reflections on Dear Data). You can watch all talks of the first day on this Youtube channel: the German blog report of the event is available from the Open Knowledge Germany blog.  

Measuring the Openness of Government Data in the Balkans

Blina Meta - May 24, 2017 in Global Open Data Index

Open Data Kosovo is a civic-tech organization that uses technology to contribute towards social good. The organization has created an exciting network of partners both local and international while working on projects related to visualizing procurement data, mapping satellite imagery for human rights violations, data collection and entry of 112 emergency calls, countering violent extremism online, providing digital solutions to public institutions, index measurement of the degree of openness of public institutions, visualizing election data, growth of the female coders community, and more. This portfolio made us a trustworthy candidate for the next task from Open Knowledge International, measuring the state of openness of government data for the countries in South Eastern Europe: Bulgaria, Macedonia, Serbia, Kosovo, Croatia, Albania, Slovenia, Bosnia and Herzegovina, Romania, Montenegro.   We agreed to the task, and thereby the journey of measuring the openness of the Southern Europe countries began. We had a two month period of submissions time, which at first glance looked like enough time but that’s always a tricky perspective. The first weeks went relatively calm: we dug up some old contacts in various countries and reached out to our partners and friends who would be interested in submitting to the index. We received positive replies by most of them and I felt calm and confident, but I also had an instinct that is only created by experience of crowdsourcing contributions, so obviously I had a plan B. We asked for help from Arianit Dobroshi, a longstanding friend of Open Data Kosovo who is excited about mapping, openness, and general digital goodness. His task was to help us with the submissions, fill out on whatever country-specific problems may there arise, and make sure tasks are completed. Time was passing and pressure was rising, and there were very few submissions on the index. It was the end of the year so I started to receive staff emails of planned vacations. This triggered an emergency alert on me: I panicked, and did what a modern woman does when they panic: I took a break and procrastinated even further for an hour or two. Then I pulled myself together and started contacting our friends from the region. First on the list was Zoran Luša, Senior IT Adviser, Ministry of Public Administration of the Republic of Croatia. Zoran immediately was up for the task and invited his colleague Anamarija Musa to join in the efforts. Croatia was never measured before so they needed to do it from scratch. Not an easy task, so we asked for some extra help just in case. We contacted Miroslav Schlossberg from CodeForCroatia, who promptly informed us that they were supposed to do a sprint to evaluate local cities so they included contributions to GODI 2016 in there. The mix was perfect: these people are serious in their digital contributions and the kind of people you want to work with. Croatia was covered. Parallel to the Global Open Data Index 2016, I was managing an EU-funded project that did a thorough index research for the openness of public institutions in the Western Balkan countries. This project is implemented with a regional network of organizations called ACTIONSEE. So I reached out to our friends from this network one by one.
  • In Serbia, we contacted our great friends from the local organization CRTA. We work with them in many exciting projects and they are always very thrilled to be part of initiatives that combine transparency and technology. Pavle Dimitrij was quick to jump on board and promised timely and accurate submissions for Serbia. Slobodan Marković reached out to us and was interested to participate, so we had two parties involved and a team at the office to make sure it goes smoothly: Serbia was covered.
  • When you think internet and government in Macedonia, you think of the Metamorphosis foundation. They are the leaders in their field, so of course we reached out to them. Tamara Resavska and Goran Rizaov rose to the challenge: Macedonia was covered.
  • Next, we contacted our friends in Albania, the organisation MJAFT. We discussed a couple of common national problems in sweet Albanian and agreed that this index submission is important. Ms. Xheni Lame promised to submit, and so she did.
  • Lastly, the Montenegro submission was agreed upon with our friends from CDT, where Milena Gvozdenovic memorably said “I find this Index very interesting and valuable. Therefore, we’ll complete the survey within the deadline.” The remaining countries were mostly filled out by the team at Open Data Kosovo: that’s how the index submission was completed, and how the community was wrangled.
The results are out today and I can’t help but feel sad for the low score of Kosovo, ranked #56 out of 94 countries with a score of 29%. Currently, we are living in a very bad environmental pollution situation, and the having open data related to the environment would surely be a good step towards advocating for improvement. Furthermore, Kosovo does have some budgetary information but they are presented in a low quality, and not in an open data format, which further decreased our score. In fact, all the Balkan countries seem to line up together at the bottom of the list sharing similar openness problems and challenges. It’s been a great experience working with Open Knowledge International and acting as Community Wrangler. I learned a lot about the state of open data in the region but I also established a network of like-minded individuals who care about having transparent countries, who are eager to see them rank higher, who thrill on seeing improvement and want to contribute towards it. I am looking forward to being part of it again next year!

Measuring the Openness of Government Data in the Balkans

Blina Meta - May 24, 2017 in Global Open Data Index

Open Data Kosovo is a civic-tech organization that uses technology to contribute towards social good. The organization has created an exciting network of partners both local and international while working on projects related to visualizing procurement data, mapping satellite imagery for human rights violations, data collection and entry of 112 emergency calls, countering violent extremism online, providing digital solutions to public institutions, index measurement of the degree of openness of public institutions, visualizing election data, growth of the female coders community, and more. This portfolio made us a trustworthy candidate for the next task from Open Knowledge International, measuring the state of openness of government data for the countries in South Eastern Europe: Bulgaria, Macedonia, Serbia, Kosovo, Croatia, Albania, Slovenia, Bosnia and Herzegovina, Romania, Montenegro.   We agreed to the task, and thereby the journey of measuring the openness of the Southern Europe countries began. We had a two month period of submissions time, which at first glance looked like enough time but that’s always a tricky perspective. The first weeks went relatively calm: we dug up some old contacts in various countries and reached out to our partners and friends who would be interested in submitting to the index. We received positive replies by most of them and I felt calm and confident, but I also had an instinct that is only created by experience of crowdsourcing contributions, so obviously I had a plan B. We asked for help from Arianit Dobroshi, a longstanding friend of Open Data Kosovo who is excited about mapping, openness, and general digital goodness. His task was to help us with the submissions, fill out on whatever country-specific problems may there arise, and make sure tasks are completed. Time was passing and pressure was rising, and there were very few submissions on the index. It was the end of the year so I started to receive staff emails of planned vacations. This triggered an emergency alert on me: I panicked, and did what a modern woman does when they panic: I took a break and procrastinated even further for an hour or two. Then I pulled myself together and started contacting our friends from the region. First on the list was Zoran Luša, Senior IT Adviser, Ministry of Public Administration of the Republic of Croatia. Zoran immediately was up for the task and invited his colleague Anamarija Musa to join in the efforts. Croatia was never measured before so they needed to do it from scratch. Not an easy task, so we asked for some extra help just in case. We contacted Miroslav Schlossberg from CodeForCroatia, who promptly informed us that they were supposed to do a sprint to evaluate local cities so they included contributions to GODI 2016 in there. The mix was perfect: these people are serious in their digital contributions and the kind of people you want to work with. Croatia was covered. Parallel to the Global Open Data Index 2016, I was managing an EU-funded project that did a thorough index research for the openness of public institutions in the Western Balkan countries. This project is implemented with a regional network of organizations called ACTIONSEE. So I reached out to our friends from this network one by one.
  • In Serbia, we contacted our great friends from the local organization CRTA. We work with them in many exciting projects and they are always very thrilled to be part of initiatives that combine transparency and technology. Pavle Dimitrij was quick to jump on board and promised timely and accurate submissions for Serbia. Slobodan Marković reached out to us and was interested to participate, so we had two parties involved and a team at the office to make sure it goes smoothly: Serbia was covered.
  • When you think internet and government in Macedonia, you think of the Metamorphosis foundation. They are the leaders in their field, so of course we reached out to them. Tamara Resavska and Goran Rizaov rose to the challenge: Macedonia was covered.
  • Next, we contacted our friends in Albania, the organisation MJAFT. We discussed a couple of common national problems in sweet Albanian and agreed that this index submission is important. Ms. Xheni Lame promised to submit, and so she did.
  • Lastly, the Montenegro submission was agreed upon with our friends from CDT, where Milena Gvozdenovic memorably said “I find this Index very interesting and valuable. Therefore, we’ll complete the survey within the deadline.” The remaining countries were mostly filled out by the team at Open Data Kosovo: that’s how the index submission was completed, and how the community was wrangled.
The results are out today and I can’t help but feel sad for the low score of Kosovo, ranked #56 out of 94 countries with a score of 29%. Currently, we are living in a very bad environmental pollution situation, and the having open data related to the environment would surely be a good step towards advocating for improvement. Furthermore, Kosovo does have some budgetary information but they are presented in a low quality, and not in an open data format, which further decreased our score. In fact, all the Balkan countries seem to line up together at the bottom of the list sharing similar openness problems and challenges. It’s been a great experience working with Open Knowledge International and acting as Community Wrangler. I learned a lot about the state of open data in the region but I also established a network of like-minded individuals who care about having transparent countries, who are eager to see them rank higher, who thrill on seeing improvement and want to contribute towards it. I am looking forward to being part of it again next year!

Balneário Camboriú é o primeiro município a assinar a Carta Compromisso do Gastos Abertos

Elza Maria Albuquerque - May 24, 2017 in Destaque, Gastos Abertos

Prefeito de Balneário Camboriú assina Carta Compromisso da Transparência, iniciativa do Gastos Abertos. Foto: Prefeitura Balneário Camboriú.

Nesta terça-feira (23/05), o Prefeito de Balneário Camboriú (SC), Fabrício Oliveira, assinou a Carta Compromisso de Transparência, do Gastos Abertos (movimento para conectar o dinheiro público com os cidadãos via capacitação, dispositivos legais e articulação política). Ele é o primeiro prefeito brasileiro a assinar o documento. Ao fazer isso, ele se compromete com uma agenda de transparência na prática. Isto quer dizer que ele deverá executar ações concretas que vão permitir ao cidadão um melhor e maior acesso aos dados orçamentários da cidade. De acordo com o prefeito Fabrício Oliveira, a assinatura da Carta Compromisso vai permitir “associar as ferramentas e o conhecimento da Open Knowledge no manejo de uma grande quantidade de dados da Prefeitura, organizando-os de modo a facilitar o acesso do cidadão, por meio do Portal da Transparência, adequando o poder público municipal ao novo tempo de transparência total que a sociedade exige.” O responsável pela articulação da iniciativa foi Gabriel Pimentel, líder local voluntário do Gastos Abertos da Open Knowledge Brasil. A ação faz parte da terceira missão do ciclo 1 do Gastos Abertos. “Foi muito bom participar do Gastos Abertos e ter essa resposta com a assinatura da Carta Compromisso. Eu aprendi muito. Quando comecei o projeto, não esperava que teria essa dimensão. Nesse processo, foi muito importante a parceria do Observatório Social de Balneário de Camboriú, com o Antônio Cotrim, e o suporte que do Sustenta-habilidade – Projeto de Extensão da Univali”, conta Gabriel. Thiago Rondon, coordenador do Gastos Abertos, destaca a importância da ação. “Os resultados em Balneário Camboriú são valiosos, pois em conjunto com outras cidades que participam deste ciclo estão ajudando a construir uma metodologia cada vez mais efetiva e escalável, auxiliando na construção de uma tecnologia social capaz de tornar a transparência acessível à todos.” Além do Gabriel, a assinatura da carta contou com a presença do secretário de Controle Governamental e Transparência Pública, Victor Hugo Domingues; dos professores do Projeto de Extensão e Capacitação de Lideranças para Governança Socioambiental da Univali – Projeto Sustenta-Habilidade, Ricardo Stanziola Vieira e Charles Alexandre Souza Armada; e do vereador Lucas Gotardo, presidente da Comissão de Transparência e Governança Pública. Flattr this!

The lost privacy of Ukrainian students: a story of bad anonymisation

Vadym Hudyma - May 23, 2017 in Data Blog, Ukraine

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

Ukraine has long been plagued with corruption in the university admission process due to a complicated and untransparent process of admission, especially for state-funded seats. To get it would-be students required not just a good grades from school (which also was subject of manipulation), but usually some connections or bribes to the universities admission boards. Consequently, the adoption of External Independent Evaluation Testing (EIT) (as the primary criteria for admission into universities is considered one of a handful of successful anticorruption reforms in Ukraine. External independent evaluation is conducted once a year for a number of subjects, anyone with school diploma can participate in it. It is supervised by an independent government body (CEQA – Center for Educational Quality Assessment) with no direct links neither to school system nor major universities, All participant names are protected with unique code to protect results from forgery. {Explanation of the system in 1-2 sentence.} The EIT has not eradicated corruption, but reduced it to a negligible level in the university admissions system. While its impact on the school curriculum and evaluation is, and should be, critically discussed, its success in providing opportunities for a bright student to get a chance to choose between the best Ukrainian universities is beyond doubt. Also, it provides researchers and the general public with a very good tool to understand, at least on some level, what’s going on with secondary education based on unique dataset of country-wide results of university admission tests. Obviously, it’s also crucial that the results of the admission tests, a potentially life-changing endeavour, must be held as privately and securely as possible. Which is why we were stricken when the Ukrainian Center for Educational Quality Assessment (CEQA) also responsible for collecting and managing the EIT data, released this August a huge dataset of independent testing results from 2016. In this case, this dataset includes individual records. Although the names and surnames of participants were de-identified using randomly assigned characters, the dataset was still full of multiple other entries that could link to exact individuals. Those include exact scores (with decimals) of every taken test subject, the birth year of each participant, their gender, whether they graduated this year or not and, most damning, the name of the place where each subject of external examination was taken – which is usually the schools at which participants got their secondary education.

I. Happy Experts

Of course, the first reaction from the Ukrainian Open Data community was overwhelmingly positive, helped with the fact that previous releases of EIT datasets were frustrating in their lack of precision and scope. A Facebook post announcing the publication: “Here are the anonymized results of IET in csv #opendata” image alt text *A Facebook comment reacting to the publication: “Super! Almost 80 thouthands entries” (actually more ;) * image alt text A tweet discussing the data: “Some highly expected conclusions from IET data from SECA…” As Igor Samokhin, one of the researchers who used the released EIT dataset in his studies, put it: “[..This year’s] EIT result dataset allows for the first time to study the distribution of scores on all levels of aggregation (school, school type, region, sex) and to measure inequality in scores between students and between schools on different levels.[…] The dataset is detailed enough that researchers can ask questions and quickly find answers without the need to ask for additional data from the state agencies, which are usually very slow or totally unresponsive when data is needed on the level lower than regional.” Indeed, the dataset made possible some interesting visualisations and analysis. image alt text A simple visualisation showing differences in test results between boys and girls image alt text Quick analysis of birth years of those who took IET in 2016 But that amount of data and the variety of dimensions (characteristics) available carry many risks, unforeseen by data providers and overlooked by the hyped open data community and educational experts. We’ve made a short analysis of most obvious threat scenarios.

II. What could go wrong?

As demonstrated by various past cases across the world, microdata disclosure, while extremely valuable for many types of research such as longitudinal studies, is highly susceptible to re-identification attacks. To understand the risks involved, we went through a process called threat modeling. This consists in analysing all the potential weaknesses of a system (here the anonymisation technique used on the dataset) from the point of view of a potential individual with malicious intentions (called’ attacker’). Three threat models emerged from this analysis:

The ‘Nosy neighbour’ scenario

The first and most problematic possibility is called the “nosy neighbour” scenario. This corresponds to an unexpected disclosure of results from relatives, neighbours, school teachers, classmates, or anyone with enough knowledge about an individual described in the dataset to recognize who the data describes – without having to look at the name. The risks involved with this scenario include possible online and offline harassment against people with too low or too high – depending on context – test results. Unwanted disclosure may happen because members in the subject’s close environment can already have some additional information about the person. If you know that your classmate Vadym was one of the rare person of the village to take chemistry in the test, you can easily deduce which line of the data corresponds to him, discovering in the same way all the details of his tests results. And depending on what you (and others) discover about Vadym, the resulting social judgment could be devastating for him, all because of an improperly anonymised dataset. This is a well-known anonymisation problem – it’s really hard to get a good anonymity with that many dimensions – in this case, the subject and exact results of multiple tests and their primary examination location. It’s an especially alarming problem for schools in small villages or specialised schools – where social pressure and subsequent risk of stigmatisation is already very high.

The ‘Ratings fever’ problem

image alt text Map of schools in Kiev, Ukraine’s capital, made by the most popular online media based on EIT results The second problem with educational data is hardly new and the release of this dataset just made it worse. With added precision and targeting power, more fervour was granted to the media’s favoured exercise of grading schools according to successes and failures of the external testing results of its students. In previous years, many educational experts criticised ratings made by media and the different government authorities for incompleteness: they were based either on a full dataset, but for only one test subject, or were made using heavily aggregated and non-exhaustive data. But such visualisations can have consequences more problematic than misleading news readers about the accuracy of the data. The issue here is about the ethical use of the data, something often overlooked by the media in Ukraine, who happily jumped on the opportunity to make new ratings. As educational expert Iryna Kogut from CEDOS explains: “EIT scores by themselves can not be considered as a sign of the quality of education in a individual school. The new dataset and subsequent school ratings based on it and republished by CEQA only maintains this problem. Public opinion about the quality of teaching and parental choice of school relies on results of the EIT, but the authors of the rating do not take into account parents’ education, family income, the effect of private tutoring and others out-of-school factors which have a huge influence on learning results. Besides, some schools are absolutely free to select better students (usually from families with higher socioeconomic status), and this process of selection into “elite” schools is usually neither transparent nor fair. So they are from the start not comparable with the schools having to teach ‘leftovers’. “ Even as people start understanding the possible harm of the “rate everything” mentality for determining both public policy and individual decisions, almost every local website and newspaper has made or republished school ratings from their cities and regions. In theory, there could be benefits to the practice, such as efforts to improve school governance. Instead, what seems to happen is that more students from higher-income families migrate to private schools and less wealthy parents are incentivised to use ‘unofficial’ methods to transfer their kids to public school with better EIT records. Overall, this is a case where the principle “the more informed you are the better” is actually causing harm to the common good – especially when there is no clear agenda or policy in place to create a fairer and more inclusive environment in Ukrainian secondary education.

Mass scale disclosure

The last and most long-term threat identified is the possible future negative impact on the personal life of individuals, due to the unwanted disclosure of test results. This scenario considers the possibility of mass scale unwanted identity disclosure of individuals whose data were included in recent EIT data set. As our research has shown, it would be alarmingly easy to execute. The only thing one needs to look at is already-published educational data. To demonstrate the existence of the this threat, we only had to use one data set: close to half of the EIT records could be de-anonymised with varying level certainty, meaning that we could find the identity of the individual behind the results (or narrow down the possibility to a couple of individuals) for one hundred thousand individual records. The additional dataset we used comes from another government website – vstup.info – which lists all applicants to every Ukrainian university. The data includes the family names and initials of each applicant, along with the combined EIT results scores. The reason behind publishing this data was to make the acceptance process more transparent and cut space for possible manipulations. But with some data wrangling and mathematical work, we were able to join this data with the IET dataset, allowing a mass scale de-anonymisation. So what should be the lessons learned from this? First, while publishing microdata may bring enormous benefits to the researchers, one should be conscious that anonymisation may be really hard and non-trivial problem to solve. Sometimes less precision is needed to preserve anonymity of persons whose data is included in the dataset. Second – it’s important to be aware of any other existing datasets, which datasets may be used for de-anonymization. It’s responsibility of data publisher to make sure of that before any information sharing. Third – it’s not enough just to publish dataset. It’s important to make sure that your data wouldn’t be used in obviously harmful or irresponsible manner, like in various eye-catching, but very damaging in the long run ratings and “comparisons”. Flattr this!

Public and Private Life of Animals (1877)

Adam Green - May 23, 2017 in Balzac, France, grandville, J.J. Grandville, paris, satire

Collection of acerbic animal fables, penned by the likes of Honoré de Balzac and George Sand, and illustrated by the brilliant J. J. Grandville.

Some basic information on Freedom of Information in Sweden

Xiaowei Chen - May 23, 2017 in Open Data

1. Transparent governmental structure: at the demand of the King!
Sweden with its current population of 9.8 Million used to be a relative poor country, its wealth hasn’t been growing until 1870s. The King had to collect tax from very limited population to finance the consumption of his Government as well as his family—the reason why he demanded strictly that the tax, which local government officers/public servants collected, will be handed over with little loss to the loyal household. This demand from above—not from the bottom of the society—ended up in a governmental system of high efficiency, little corruption and a transparent tradition in this country.
Yet this cannot be seen as an inevitable result: compared to Russia. A tradition of making compromises has its roots deep in Swedish history. Kings in Sweden always had limited power. Representatives from 4 Estates- Nobility, Clergy, Burghers and Peasants- built a parliament (the so-called “diet”) back in the 15th century, with whom the King had to negotiate about taxation issues. Also in the court, the Jury was made of representatives of local farmers, businessmen, etc. And even back in Viking times, there has been democratic tradition similar to Greece. To balance interests of the society, Swedish Kings have mostly living a decent life, and this country has been lucky enough to avoid/prevent dictators. 2. FOI originally as a limit to the Kings power, but later also to the parliament itself
During 1718-1772, Sweden experienced its “age of liberation”. The Diet passed in 1766 the Freedom of the Press Act (Tryckfrihetsförordningen, TF for short). This remains a part of Swedish constitution until now. “The principle of public” (offentlighetsprincipen) was set since 1766, it entrusted the right to every Swedish Citizen to inspect public documents that are not secret and the right to attend court proceedings and decision-making political meetings. Strictly according to the law you have to own the Swedish citizenship, but this is now to be changed. As a matter of fact, it is almost always understood as a right of everyone in the praxis. Similar laws exist nowadays already in over 70 countries, yet 251 years ago, Sweden was the very first country in the world for this legislation, which is actually also a result of power balance.
Before the law giving, Swedish Diet could only react to measures taken by the monarchy. Yet this legislation allows the Diet pre-act to the authorities led by the King. As the monarchy passed his power step by step to modern political offices, the parliament itself, as well as other authorities, are now the main targets of the Freedom of the Press Act, and the King acts only as representative of the country. 3. Transparent Government with mini-bikini
In Sweden, all public documents are public, unless it´s marked “classified”. –While in many other countries, documents are not accessible, unless they are published by the government itself.
Every Swedish citizen shall be entitled to have free access to official documents, in order to encourage the free exchange of opinion and the availability of comprehensive information. Through a request to the government (personal visit, Email, letter, phone call…), and he will get an answer in a few days. The right of access to official documents may be restricted only if restriction is necessary with regard to:
1. the security of the Realm or its relations with another state or an international organisation;
2. the central fiscal, monetary or currency policy of the Realm;
3. the inspection, control or other supervisory activities of a public authority;
4. the interests of preventing or prosecuting crime;
5. the economic interests of the public institutions;
6. the protection of the personal or economic circumstances of individuals;
7. the preservation of animal or plant species.
(The Constitution of Sweden, The Freedom of the Press Act, Chapter 2. Link: http://riksdagen.se/en/SysSiteAssets/07.-dokument–lagar/the-constitution-of-sweden-160628.pdf/ )
All documents are public until it has gone through confidentiality test when someone requests it. From tax records to decision of board of a public school, thousands of documents are requested each day, it is impossible to hold an official statistics shows the number of FOI cases. Roughly speaking, 95% belong to public documents, although they are not automatically publicized.
To make it even harder for Swedish Government to hide any documents, each public office is obligated to keep a “transparency diary”, which keeps track of all activities and signed documents with clear “regards” of each document . Even the Government gets the documents classified, the diary remains open, so that the public are still aware of the subjects that the government is working on, even when the details are classified. It is a routine job for journalists to check this register and ask for documents according to it. (More details about this mechanism will follow in two weeks after another interview.) 4. Mechanisms: In case of delay or refusal
There is the possibility that requests from citizen get refused by the government. Sweden has four pillars to support citizens access to governmental documents: Parliamentary Ombudsman, Chancellor of Justice, court, and the media.
Parliamentary Ombudsman and Chancellor of Justice are responsible to supervise public authorities and officials observe the law and fulfil their duties.
Swedish “Delay” has a very different definition compared to Germany (30 days), a governmental office might get criticized by the Chancellor of Justice for not responding to a question in 2 days(!!!). This is great benefit for journalists to need governmental information and work under time pressure.
Chancellor of Justice often criticises openly the work of different apartments of Swedish government: http://www.jk.se/beslut-och-yttranden/ It has also the right to prosecute improper act of the authority, which is rarely exercised. In 2016 the Chancellor of Justice handled at total of about 7 933 cases. (More details follow after another interview with Parliamentary Ombudsman.)
If the authority still refuses to give a certain information without proper explanation, it could be prosecuted by the citizen in the court—FOR FREE!!! The complainant doesn´t have to hire a lawyer, so the cost could be rather limited. It takes about one or two months (in Germany this could be six months or more) until the judge decides fairly according to the law.
In addition, Swedish law protects the “Freedom of Whistleblowing” (Meddelarfrihet) – so that civil servants in public sector are allowed to leak illegal behaviour in a public sector, even when this involves classified information. This whistle blower is not supposed to talk to a random person but a journalist, with the purpose of revealing an illegal behaviour.
The journalist, together with the “responsible editor” of his news-agency, should decide whether to report. And the legal responsible editor takes full responsibility of the report. The journalist, as well as his sources, is well protected under the freedom of press, it is illegal if anyone tries to ask the journalist about his informant. Legal responsible editor is supposed to give a second option, but the risk he carries force him to be cautious. This could be seen as a kind of “self-censorship” to a certain level. 5. Has EU made Sweden a better place?
Documents and communications between different apartments of Sweden are transparent, but since Sweden become a EU member in 1995, it has to follow EU regulation when it refers to EU documents, EU refused to make exceptions for Sweden at this matter. The number of classified documents in Sweden has therefore been increasing. This has caused protests among journalists-not strong enough to cause “Brexit”, but the hope from Sweden, that the EU should increase its transparency, has to be considered by Brussel. This report based mainly on a talk with:
Jonas Nordin (Professor of History, Swedish national library, May 8th)
Also enriched by talks with:
Tove Carlen (Ombudsman of Swedish Journalistförbundet, , May 9th)
Johan Hirschfeldt (lawyer, former President of Court of Appeal/Svea Hovrätt, May 18th)
and many thoughts from different sources
A lot of thanks!