lundi 28 septembre 2020

How can i get my webscraper to skip over information I cant parse so it doesn't break.?

I've created this webscraper and when it runs it will break because I do not have a rule to keep running if it can't parse certian information. I am taking names and numbers of real estate agents however not everyone has their number on the website, when i run into a realtor without a number the script will stop working and return an error. I am a beginner at this and cant find the proper way to get it to keep looping between pages if the required information isnt found. it just stops when it can't find anymore information. I am aware this is a noob question but for the life of me I cannot get it to keep running over the missed information.

import requests
from requests import get
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 
import numpy as np
from numpy import arange

from time import sleep
from random import randint

headers = {"Accept-Language": "en-US,en;q=0.5"}

my_url = 'https://www.realtor.com/realestateagents/phoenix_az/pg-2'

#opening up connection, grabbing the page
uClient = uReq(my_url)
#read page 
page_html = uClient.read()
#close page
uClient.close()

pages = np.arange(1, 30, 1)

for page in pages:

    page = requests.get("https://www.realtor.com/realestateagents/phoenix_az/pg-2" + str(page) + "&ref_=adv_nxt", headers=headers)


#html parsing
page_soup = soup(page_html, "html.parser")

sleep(randint(2,10))

#finds all realtors on page 
containers = page_soup.findAll("div",{"class":"agent-list-card clearfix"})

for container in containers:
    name = container.find('div', class_='agent-name text-bold')
    agent_name = name.text.strip()

    number = container.find('div', class_='agent-phone hidden-xs hidden-xxs')
    agent_number = number.text.strip()


print("name: " + agent_name)
print("number: " + agent_number)

below is the rule i wrote to try and solve this issue however i am not great at this and not entirely sure where i am going wrong yet.

nv = container.find_all('div', attrs={'number': 'nv'})
number = nv[1].text if len(nv) > 1 else '-'



Aucun commentaire:

Enregistrer un commentaire