dimanche 27 septembre 2020

Script that scrapes website infinitely without scraping duplicates doesn't work

I wrote a simple script, and while it works when scraping the website, when trying to make it not scrape duplicates, it doesn't work. I think the logic to make it not scrape duplicates would be:

  1. adding all the links to a list
  2. getting new links and comparing to the 1st list
  3. if new links in second list aren't in 1st list then amend to the 1st list?
import requests
import time
from bs4 import BeautifulSoup
import sys

f = open("links.txt", "a")
list_=[]

while True:
    try:
        URL = f'WEBSITEURL.COM'
        page = requests.get(URL)
        time.sleep(1)

        soup = BeautifulSoup(page.text, 'html.parser')

        data = soup.findAll('div',attrs={'class':'card-content'})
        for div in data:
            links = div.findAll('a')
            for a in links:
                if a not in list_:
                    f.write(a['href'])
                    f.write('\n')
                    print (a['href'])
                elif:
                    continue
    except Exception as e:
        print('something went wrong')
        #continue



Aucun commentaire:

Enregistrer un commentaire