I'm trying to get multiple addresses from a web page with an A to Z of links.
First I get A to Z links with:
URL = "http://www.example.com"
html = urlopen(URL).read()
soup = BeautifulSoup(html, "lxml")
content = soup.find("div", "view-content")
links = [BASE_URL + li.a["href"] for li in content.findAll("li")]
This works great and in links above I have a list of links to each individual web page with multiple addresses on each separate page.
For getting the addresses I need I used:
for item in links[0:5]:
try:
htmlss = urlopen(item).read()
soup = bfs(htmlss, "lxml")
titl = soup.find('div','views-field-title').a.contents
add = soup.find('div','views-field-address').span.contents
zipp = soup.find('div','views-field-city-state-zip').span.contents
except AttributeError:
continue
The above code will take each link and get the first address on the page with all the A's and the first address on the page with all the B's and so on.
My problem is that on some of the pages there are multiple addresses on each page and the above code only retrieves the first address on that page i.e. First A address first B address and so on.
I've tried using soup.findAll but it doesn't work with a.content or span.content
Basically I need to find the address lines in the html pages with non-unique tags. If I use soup.findAll I get all the content for say (div, views-field-title) which gives me a lot of content I don't need.
Aucun commentaire:
Enregistrer un commentaire