I have a web scraper written in python, that extracts some data from web pages. It is filtering out unwanted type of documents, such as .pdf, by looking into extension. Such web pages are skipped during scraping process. However, there are some web sites that have big pdf documents embedded but are not indicated to be a pdf by extension. For example, this one is problematic: https://tel.archives-ouvertes.fr/tel-01801803/document I need to skip such web sites. Is there a way to recognize that a web page is pdf document, even if there is no .pdf extension?
Aucun commentaire:
Enregistrer un commentaire