Extracting Data from PDFs
October 31 @ 3:00 pm - 4:30 pm
Modern Languages Building (MLB), Room 2001A
Do you have useful information that resides in a data table within a PDF? These are notoriously hard to extract using standard pdf text miner tools. Come explore with us, Tabula, a tool to extract these precious datasets within. Tabula works with text-based PDFs and not scanned documents that might require OCR. The workshop will work through some hands-on example. Tabula can be used with a multitude of languages including Java, Ruby, Node.js, R and Python. The workshop will be conducted in Python using Anaconda Python 3.5 and a Jupyter Notebook. The workshop is intended for participants with some programming background.