Pages

Thursday, February 27, 2020

Week 5: Parsing the PDF

Through our most recent endeavor into data management, there were some issues regarding our ability to parse a PDF containing user-filled data forms. Hours of experimenting with different PDF-to-text classes provided some form of text output but the entered form data was missing. A band aid over the issue was to fill out the form then use the print to PDF function found on most devices, flattening the document into just a layer of text. This presented other unexpected issues that did not provide the results we needed, including blank or unreadable output.

Through some experimenting, I was able to find a parser that will pull just the form data and nothing else. Using a doc parsing class, I was able to pull form data without having to save to a new file or print to PDF.

In order to gather the form data in a way that organizes it for upload to the database, the doc class must arrange data in a way that prepares it for assignment to the appropriate cells. Browsing through the class structure and understanding how it performs text collection from documents shed some light into modifying it to insert delimiters between each parsed form field.

You can think of a delimiter acting as a cell wall like those found in spreadsheets, but instead data is separated by a text character. Using character delimiters allows scripts to easily recognize the start and end data points and can easily isolate what comes before/after/between the delimiters.

In its default state, the doc class uses the space character as a delimiter, which makes organizing data difficult. Using the standard comma delimiter can be useful here, but what if your data points must include commas? This can pose a problem or require limiting the scope of acceptable characters in a form field.Using an uncommon character as a delimiter ensures that almost all characters can be included in the data points, giving much more freedom of choice to the user.



 The resulting combination of code now spits out a delimited string that can be easy converted and organized into an array, that can then be sent to the database.

No comments:

Post a Comment