You are browsing the archive for pdf.

Hackatón de liberación de datos en PDFs

- January 14, 2014 in comunidad, ddj, Evento, hackaton, journalism, okfn, ong, opendata, opengov, pdf, web

El sábado 18 del primer mes de este flamante año, a partir de las 11 horas, la comunidad de datos va a participar de un hackatón de “liberación de PDFs” en GarageLab. Sumados a la convocatoria que realiza Sunlight Foundation el 17, 18 y 19 de Enero en diferentes sedes: (Washington, Chicago, San Francisco, entre otras ciudades). Todos con el objetivos de “Liberar” los sets de datos que están atrapadas en archivos PDF. La invitación esta abierta a programadores, activistas, periodistas y todos los interesados en el trabajo con datos. Llevando a cabo y motivando el encuentro, esta Manuel Aristarán: “Hay una enorme cantidad de información atrapada en archivos PDF. Hay dos razones para eso, como dije en la charla que dí en la MediaParty 2013 de Hacks/Hackers Buenos Aires. La primera es ignorancia; muchos no saben que el PDF un pésimo formato para compartir información. La segunda es pura maldad: extraer datos de archivos de PDF es por lo menos molesto y muchos se aprovechan de eso”. El desarrollador de Tabula, destacó que “Como becario 2013 del programa Knight-Mozilla OpenNews, trabajé bastante con conjuntos de datos en formato PDF. El resultado de ese interés fue Tabula, una herramienta libre para extraer tablas de archivos PDF que generó bastante entusiasmo en la comunidad de periodismo de datos y de datos abiertos”. Captura de pantalla 2014-01-13 a la(s) 23.13.36
Si tenés ganas de pasarte la tarde scrapeando PDFs, o si tenés algún conjunto de datos en PDF que quieras liberar, llená el formulario de inscripción.
Te esperamos!

Scraping PDFs with Python and the scraperwiki module

- August 16, 2013 in pdf, Python, scraping, tech


While for simple single or double-page tables tabula is a viable option – if you have PDFs with tables over multiple pages you’ll soon grow old marking them. This is where you’ll need some scripting. Thanks to scraperwikis library (pip install scraperwiki) and the included function pdftoxml – scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate – that was quite tricky to scrape – I decided to protocol the process here. import scraperwiki, urllib2 First import the scraperwiki library and urllib2 – since the file we’re using is on a webserver – then open and parse the document… u=urllib2.urlopen("http://images.derstandard.at/2013/08/12/VN2p_2012.pdf")
#open the url for the PDF
x=scraperwiki.pdftoxml(u.read()) # interpret it as xml
print x[:1024] # let's see what's in there abbreviated...
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.22.5">
<page number="1" position="absolute" top="0" left="0" height="1263" width="892">
    <fontspec id="0" size="8" family="Times" color="#000000"/>
    <fontspec id="1" size="7" family="Times" color="#000000"/>
<text top="42" left="64" width="787" height="12" font="0"><b>TABELLE VN2Ap/1                         
                                                  30/07/13  11.38.44  BLATT    1 </b></text>
<text top="58" left="64" width="718" height="12" font="0"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) FÜR NEUGEBORENE KNABEN MIT </b></text>
<text top="73" left="64" width="340" height="12" font="0"><b>ÖSTERREICHISCHER STAATSBÜRGERSCHAFT 2012  - ÖSTERREICH </b></text>
<text top="89" left="64" width="6" height="12" font="0"><b> </b></text>
<text top="104" left="64" width="769" height="12" font="0"><b>VORNAMEN                  ABSOLUT      
%   
As you can see above, we have successfully loaded the PDF as xml (take a look at
the PDF by just opening the url given, it should give you an idea how it is
structured). The basic structure of a pdf parsed this way will always be page tags
followed by text tags contianing the information, positioning and font
information. The positioning and font information can often help to get the
table we want – however not in this case: everything is font=”0″ and left=”64″. We can now use xpath to query our
document… import lxml
r=lxml.etree.fromstring(x)
r.xpath('//page[@number="1"]')
[<Element page at 0x31c32d0>]
and also get some lines out of it r.xpath('//text[@left="64"]/b')[0:10] #array abbreviated for legibility
[<Element b at 0x31c3320>,
 <Element b at 0x31c3550>,
 <Element b at 0x31c35a0>,
 <Element b at 0x31c35f0>,
 <Element b at 0x31c3640>,
 <Element b at 0x31c3690>,
 <Element b at 0x31c36e0>,
 <Element b at 0x31c3730>,
 <Element b at 0x31c3780>,
 <Element b at 0x31c37d0>]
r.xpath('//text[@left="64"]/b')[8].text
u'Aaron *                        64       0,19       91               Aim\xe9                        
1       0,00      959 '
Great – this will help us. If we look at the document you’ll notice that there
are all boys names from page 1-20 and girls names from page 21-43 – let’s get
them seperately… boys=r.xpath('//page[@number<="20"]/text[@left="64"]/b')
girls=r.xpath('//page[@number>"20" and @number<="43"]/text[@left="64"]/b')
print boys[8].text
print girls[8].text
Aaron *                        64       0,19       91               Aimé                            1
   0,00      959 
Aarina                          1       0,00    1.156               Alaïa                           1
   0,00    1.156 
fantastic – but you’ll also notice something – the columns are all there,
sperated by whitespaces. And also Aaron has an asterisk – we want to remove it
(the asterisk is explained in the original doc). To split it up into columns I’ll create a small function using regexes to split
it. import re def split_entry(e):
return re.split("[ ]+",e.text.replace("*","")) # we're removing the asterisk here as well... now let’s apply it to boys and girls boys=[split_entry(i) for i in boys]
girls=[split_entry(i) for i in girls]
print boys[8]
print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959', u'']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156', u'']
That worked!. Notice the empty string u” at the end? I’d like to filter it.
I’ll do this using the ifilter function from itertools import itertools
boys=[[i for i in itertools.ifilter(lambda x: x!="",j)] for j in boys]
girls=[[i for i in itertools.ifilter(lambda x: x!="",j)] for j in girls]
print boys[8]
print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156']
Worked, this cleaned up our boys and girls arrays. We want to make them properly
though – there are two columns each four fields wide. I’ll do this with a little
function def take4(x):
if (len(x)>5):
return [x[0:4],x[4:]]
else:
return [x[0:4]] boys=[take4(i) for i in boys]
girls=[take4(i) for i in girls]
print boys[8]
print girls[8]
[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\xe9', u'1', u'0,00', u'959']]
[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\xefa', u'1', u'0,00', u'1.156']]
ah that worked nicely! – now let’s make sure it’s one array with both options in
it -for this i’ll use reduce boys=reduce(lambda x,y: x+y, boys, [])
girls=reduce(lambda x,y: x+y, girls,[])
print boys[10]
print girls[10]
['Aiden', '2', '0,01', '667']
['Alaa', '1', '0,00', '1.156']
perfect – now let’s add a gender to the entries for x in boys:
x.append("m") for x in girls:
x.append("f") print boys[10]
print girls[10]
['Aiden', '2', '0,01', '667', 'm']
['Alaa', '1', '0,00', '1.156', 'f']
We got that! For further processing I’ll join the arrays up names=boys+girls
print names[10]
['Aiden', '2', '0,01', '667', 'm']
let’s take a look at the full array… names[0:10]
[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],
 ['BLATT', '1', 'm'],
 [u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],
 [u'PHONETISCH',
  u'ZUSAMMENGEFASST,',
  u'ALPHABETISCH',
  u'SORTIERT)',
  u'F\xdcR',
  u'NEUGEBORENE',
  u'KNABEN',
  u'MIT',
  'm'],
 [u'\xd6STERREICHISCHER', u'STAATSB\xdcRGERSCHAFT', u'2012', u'-', 'm'],
 ['m'],
 ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
 ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
 ['m'],
 ['INSGESAMT', '34.017', '100,00', '.', 'm']]
Notice there is still quite a bit of mess in there: basically all the lines
starting with an all caps entry, “der”, “m” or “f”. Let’s remove them…. names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries
names=[i for i in itertools.ifilter(lambda x: not (x[0] in ["der","m","f"]),names)]
# remove all entries that are "der","m" or "f"
names[0:10]
[['Aiden', '2', '0,01', '667', 'm'],
 ['Aiman', '3', '0,01', '532', 'm'],
 [u'Aaron', u'64', u'0,19', u'91', 'm'],
 [u'Aim\xe9', u'1', u'0,00', u'959', 'm'],
 ['Abbas', '2', '0,01', '667', 'm'],
 ['Ajan', '2', '0,01', '667', 'm'],
 ['Abdallrhman', '1', '0,00', '959', 'm'],
 ['Ajdin', '15', '0,04', '225', 'm'],
 ['Abdel', '1', '0,00', '959', 'm'],
 ['Ajnur', '1', '0,00', '959', 'm']]
Woohoo – we have a cleaned up list. Now let’s write it as csv…. import csv
f=open("names.csv","wb") #open file for writing
w=csv.writer(f) #open a csv writer
w.writerow(["Name","Count","Percent","Rank","Gender"]) #write the header
for n in names:
w.writerow([i.encode("utf-8") for i in n]) #write each row
f.close() Done, We’ve scraped a multi-page PDF using python. All in all this was a fairly
quick way to get the data out of a PDF using the scraperwiki module. flattr this!

PDF / TIFF / Scan to Text Conversion Service

- April 26, 2011 in pdf, service, text

Dedicated service to do PDF to text extraction (or from TIFF or from other image scan formats). (Originally: http://trac.okfn.org/ticket/182)