hello I've been learning python for only three months and I have a project homework but ı did'n complete code ı need help ı write to diffrent code then ı will combine them but ı agains some error this point ı wanna suggest
MY DUTY TEXT
Python Projects
Newspaper webpage crawling: It should crawl the websites specified. It should show the option of storing raw html or only the news text. Second option should include news topic and the text. Output files should be html for raw html files, txt for news text. Name of the files can be arbitrary numbers. For each website a folder should be created. Inputs: Websites Crawling depth Storing option (raw html or news next) Root folder Output: Folders with website name Files containing website data
MY first code
-
import requests from bs4 import BeautifulSoup
url = 'https://www.dailystar.co.uk/' r = requests.get(url) source = BeautifulSoup(r.content,"lxml") title_link =source.find_all("h3",attrs={"class":"title"})
file_text = open("news_title.text","w") file_text.write("-") for link in title_link: print(link.text) file_text.write(str(link.text)) file_text.write("\n-") file_text.close()
print("*******************************") file_html = open("html_adress.html","w") for html in source: file_html.write(str(html)) file_html.close()
print("*******************************")
my second code import requests from bs4 import BeautifulSoup
Read News Application
# headers = requests.utils.default_headers() headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', })
print("Lütfen bekleyin... Haberler çekiliyor...\n") url= 'https://www.dailystar.co.uk/' istek=requests.get(url,headers) icerik=istek.content soup = BeautifulSoup(icerik, "html.parser")
print(" LİNKlER VE HABERLER ŞU ŞEKİLDE:\n ------------------------")
haberler=soup.find_all("h3",{"class": "title"}) linkler=soup.find_all("a",{"class": "story"})
sayi=1 for i in haberler:
print(sayi, "-)", i.text) sayi+=1 sayi =1
for i in linkler:
print(sayi, "-)", i.get("href")) sayi+=1 istek2 = requests.get(i.get("href"), headers) istek_soup = BeautifulSoup(istek2.content, "lxml") print(istek2.status_code, "İstek durumu") metin = istek_soup.find_all("div", {"class": "news-content"}) for j in metin: print(j.text)**strong text**
Aucun commentaire:
Enregistrer un commentaire