lundi 29 mai 2017

Parsing a certain webpage in python

I'm trying to split on every instance of "href" in a between two certain tags. to be specific here's what I'm working with: `

req = urllib2.Request('http://tv1.alarab.com/')
response = urllib2.urlopen(req)
link = response.read()
target = re.findall(r'<div id="nav">(.*?)</div>', link, re.DOTALL)
for items in target:
    mypath = items.split(' href="/')[1].split('/')[0]
    print mypath

Here's what it prints out:

view-5553

It's only printing the first instance. On another website, I'm using the exact same approach and it prints all the instances on when it meets an "href"

Here's what I have for another website:

req = urllib2.Request('http://ift.tt/2rhdmx1')
response = urllib2.urlopen(req)
link = response.read()
target = re.findall(r'<ul class="hidden-xs">(.*?)</ul>', link, re.DOTALL)
for items in target:
    mypath = items.split('href="')[1].split('">')[0]
    print mypath

Here's what this one prints out, which is basically what I want the first piece of code to print out:

/Album-1104708-1/
/Cat-134-1
/Cat-100-1
/Album-1104855-1/
/Cat-121-1

I tried running the debugger and it seems like the for loop is only executing once for the first one. I'm not sure why or what's going on. Any help would be appreciated.




Aucun commentaire:

Enregistrer un commentaire