I am currently writing a web scraper that essentially is parsing essentially an eCommerce site, There are approximately 1500 different pages to view the content, all of which is easily parsed. My original process was just to read the last web page number and then just use a counter in a loop to generate each new page, however I know this is quite slow (I am using python). Is it worthwhile to try to break up task into chunks and do it in parallel? Additionally, the page has a view that you can select to see more objects per page. So instead of viewing 10 at a time you can view 100, therefore decreasing number of pages by factor of 10. That is outside of the URL though, not something I could declare in it. In this case, would it be wise to try to interact with the web page to select "view 100" per page and then run the job?
Aucun commentaire:
Enregistrer un commentaire