I wrote a simple script, and while it works when scraping the website, when trying to make it not scrape duplicates, it doesn't work. I think the logic to make it not scrape duplicates would be:
- adding all the links to a list
- getting new links and comparing to the 1st list
- if new links in second list aren't in 1st list then amend to the 1st list?
import requests
import time
from bs4 import BeautifulSoup
import sys
f = open("links.txt", "a")
list_=[]
while True:
try:
URL = f'WEBSITEURL.COM'
page = requests.get(URL)
time.sleep(1)
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.findAll('div',attrs={'class':'card-content'})
for div in data:
links = div.findAll('a')
for a in links:
if a not in list_:
f.write(a['href'])
f.write('\n')
print (a['href'])
elif:
continue
except Exception as e:
print('something went wrong')
#continue
Aucun commentaire:
Enregistrer un commentaire