lundi 27 juillet 2020

How to identify which URLs point to duplicate content?

I have a list of around 20,000 URLs, but I know that some of them point to duplicate content. In my data, this happens most often because multiple URLs resolve to the same location without redirecting to a canonical URL. It also happens at different locations (i.e., staging servers).

I am looking for a "good enough" way of creating a list of URLs that point to unique content from my original list. My list is small enough that sending GET requests (following redirects) and fetching the page content is feasible. What would be a good approach for this?

This seems to be a common problem from those doing web crawlers. Are there any tools that already exist out there to do the heavy lifting?

I found this relevant question that points to a lot of general approaches for solving this, but I am hoping someone can point me to a more specific solution.

Python preferred, but not required.




Aucun commentaire:

Enregistrer un commentaire