mercredi 15 juillet 2015

Web scrapping with Beautiful soup multiple tags

I'm trying to get multiple addresses from a web page with an A to Z of links.

First I get A to Z links with:

URL = "http://www.example.com"
html = urlopen(URL).read() 
soup = BeautifulSoup(html, "lxml")
content = soup.find("div", "view-content")
links = [BASE_URL + li.a["href"] for li in content.findAll("li")]

This works great and in links above I have a list of links to each individual web page with multiple addresses on each separate page.

For getting the addresses I need I used:

for item in links[0:5]:
    try:
        htmlss = urlopen(item).read()
        soup = bfs(htmlss, "lxml")
        titl = soup.find('div','views-field-title').a.contents
        add = soup.find('div','views-field-address').span.contents
        zipp = soup.find('div','views-field-city-state-zip').span.contents
    except AttributeError:
        continue

The above code will take each link and get the first address on the page with all the A's and the first address on the page with all the B's and so on.

My problem is that on some of the pages there are multiple addresses on each page and the above code only retrieves the first address on that page i.e. First A address first B address and so on.

I've tried using soup.findAll but it doesn't work with a.content or span.content

Basically I need to find the address lines in the html pages with non-unique tags. If I use soup.findAll I get all the content for say (div, views-field-title) which gives me a lot of content I don't need.




Aucun commentaire:

Enregistrer un commentaire