jeudi 30 juin 2016

Can't get all the source code from a web page

I'm trying to build a simple web crawler in Python that saves all my previous Facebook profile photos.

As part of my early attempts, I'm trying to get all the source code from the url of my Profile Pictures, and then to filter it just to get all the anchor elements that has a class "uiMediaThumb _6i9 uiMediaThumbMedium" (I checked and all the href of the photos I want has this class).

I'm doing this as according to what I have learned from Bucky (https://www.youtube.com/watch?v=XjNm9bazxn8).

import random
import urllib.request
import requests
from bs4 import BeautifulSoup

def put_source_in_file(str):
fw = open('temp_source.txt', 'w', encoding='utf-8')
fw.write(str)
fw.close()

def trade_spider():
    url = r'http://ift.tt/29spa4E' #url of my profile photos
    source_code = requests.get(url)
    put_source_in_file(source_code.content)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")

    for link in soup.findAll('a', {'class': 'uiMediaThumb _6i9 uiMediaThumbMedium'}):
        print(link.get('href'))

trade_spider()

The problem is that although these anchor elements appears in the original source page, they don't exist in the Respond object of the Request I'm using. I have even copied all the source code to a file and double checked it - still not there.

Can anyone help?

Thanks =)




Aucun commentaire:

Enregistrer un commentaire