So I am trying to make a web crawler that is giving me all the links to images in the given URL but many of the images that I found while looking in the page source and search in the page source with "Ctrl+F", was not printed in the output.
my code is:
import requests
from bs4 import BeautifulSoup
import urllib
import os
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
i = 0
while i < 1:
source_code = requests.get(website_url) # The source code will have the page source (<html>.......</html>
plain_text = source_code.text # Gets only the text from the source code
soup = BeautifulSoup(plain_text, "html5lib")
for link in soup.findAll('img'): # A loop which looking for all the images in the website
src = link.get('src') # I want to get the image URL and its located under 'src' in HTML
if 'http://' not in src and 'https://' not in src:
if src[0] != '/':
src = '/' + src
src = website_url + src
print src
i += 1
How should I make my code to print EVERY image that is in the in the HTML page source?
For example: the website got this HTML code:
<img src="http://ift.tt/1K4fQSA" *something* >
But the script didn't printed it's src.
P.S the script is printing the src in
How should I improve my code to find ALL the images?
Thanks to all the helpers :)
Aucun commentaire:
Enregistrer un commentaire