samedi 27 juin 2015

Extracting data from PDF files

I have a number of PDF files that I'm trying to extract data from. My final goal is building website similar to PCPartPicker.com, but using data from pricelists in my country's local stores - they don't have an online store I can grab data from.

An example of a pricelist would be : http://ift.tt/1CAEnHJ

What I want to do is:

  1. To search for a specific product, and extract the price of product.
  2. Do so for all the pricelists for different shops.
  3. Return lowest price.

I'm currently using the pdf-reader gem (http://ift.tt/VZKyCf) to convert said pricelists into text, and trying to extract the price of product.

An example of a conversion is shown here: enter image description here

Unfortunately, not all PDFs are created equal. The biggest problem with this method is that I can't write an algorithm that can accurately capture all the prices.

My biggest problems:

  1. All stores have differently formatted PDF files.
  2. For items like RAM prices, how do I accurately get the price of a 4GB/8GB? For other products I can simply read the next integer on the following line, but how do I do it for RAMs, flash cards and the like?

It has also occurred to me that perhaps a pdf-to-text conversion wasn't an answer to my goal. My original idea was that after the txt conversion, it would be easier. Is there any other way that I can properly parse PDFs for their product prices?

This is my first ever web development venture in my freshman summer holidays, so I don't really have a prof/tutor to ask. I apologize if I'm not familiar with some of the key concepts when it comes to programming and web development. Thanks in advance:)




Aucun commentaire:

Enregistrer un commentaire