web: python, How to recognize that a web page is pdf document, if have no pdf extension?

jeudi 1 avril 2021

python, How to recognize that a web page is pdf document, if have no pdf extension?

I have a web scraper written in python, that extracts some data from web pages. It is filtering out unwanted type of documents, such as .pdf, by looking into extension. Such web pages are skipped during scraping process. However, there are some web sites that have big pdf documents embedded but are not indicated to be a pdf by extension. For example, this one is problematic: https://tel.archives-ouvertes.fr/tel-01801803/document I need to skip such web sites. Is there a way to recognize that a web page is pdf document, even if there is no .pdf extension?

web

jeudi 1 avril 2021

python, How to recognize that a web page is pdf document, if have no pdf extension?

Aucun commentaire:

Enregistrer un commentaire