web: Script that scrapes website infinitely without scraping duplicates doesn't work

dimanche 27 septembre 2020

Script that scrapes website infinitely without scraping duplicates doesn't work

I wrote a simple script, and while it works when scraping the website, when trying to make it not scrape duplicates, it doesn't work. I think the logic to make it not scrape duplicates would be:

adding all the links to a list
getting new links and comparing to the 1st list
if new links in second list aren't in 1st list then amend to the 1st list?

import requests
import time
from bs4 import BeautifulSoup
import sys

f = open("links.txt", "a")
list_=[]

while True:
    try:
        URL = f'WEBSITEURL.COM'
        page = requests.get(URL)
        time.sleep(1)

        soup = BeautifulSoup(page.text, 'html.parser')

        data = soup.findAll('div',attrs={'class':'card-content'})
        for div in data:
            links = div.findAll('a')
            for a in links:
                if a not in list_:
                    f.write(a['href'])
                    f.write('\n')
                    print (a['href'])
                elif:
                    continue
    except Exception as e:
        print('something went wrong')
        #continue

web

dimanche 27 septembre 2020

Script that scrapes website infinitely without scraping duplicates doesn't work

Aucun commentaire:

Enregistrer un commentaire