mardi 18 août 2015

Web crawler if statement error

So I am trying to make a web crawler that is giving me all the links to images in the given URL but many of the images that I found while looking in the page source and search in the page source with "Ctrl+F", was not printed in the output.

my code is:

import requests
from bs4 import BeautifulSoup
import urllib
import os

print ("Which website would you like to crawl?")
website_url = raw_input("--> ")

i = 0
while i < 1:
    source_code = requests.get(website_url)  # The source code will have the page source (<html>.......</html>
    plain_text = source_code.text  # Gets only the text from the source code
    soup = BeautifulSoup(plain_text, "html5lib")
    for link in soup.findAll('img'):  # A loop which looking for all the images in the website
        src = link.get('src')  # I want to get the image URL and its located under 'src' in HTML
        if 'http://' not in src and 'https://' not in src:
            if src[0] != '/':
                src = '/' + src
            src = website_url + src
        print src
    i += 1  

How should I make my code to print EVERY image that is in the in the HTML page source?

For example: the website got this HTML code:

<img src="http://ift.tt/1K4fQSA" *something* >

But the script didn't printed it's src.

P.S the script is printing the src in

How should I improve my code to find ALL the images?

Thanks to all the helpers :)

Aucun commentaire:

Enregistrer un commentaire