I have a number of PDF files that I'm trying to extract data from. My final goal is building website similar to PCPartPicker.com, but using data from pricelists in my country's local stores - they don't have an online store I can grab data from.
An example of a pricelist would be : http://ift.tt/1CAEnHJ
What I want to do is:
- To search for a specific product, and extract the price of product.
- Do so for all the pricelists for different shops.
- Return lowest price.
I'm currently using the pdf-reader gem (http://ift.tt/VZKyCf) to convert said pricelists into text, and trying to extract the price of product.
An example of a conversion is shown here:
Unfortunately, not all PDFs are created equal. The biggest problem with this method is that I can't write an algorithm that can accurately capture all the prices.
My biggest problems:
- All stores have differently formatted PDF files.
- For items like RAM prices, how do I accurately get the price of a 4GB/8GB? For other products I can simply read the next integer on the following line, but how do I do it for RAMs, flash cards and the like?
It has also occurred to me that perhaps a pdf-to-text conversion wasn't an answer to my goal. My original idea was that after the txt conversion, it would be easier. Is there any other way that I can properly parse PDFs for their product prices?
This is my first ever web development venture in my freshman summer holidays, so I don't really have a prof/tutor to ask. I apologize if I'm not familiar with some of the key concepts when it comes to programming and web development. Thanks in advance:)
Aucun commentaire:
Enregistrer un commentaire