mardi 28 août 2018

Crawling google search url list with python

I'd like to scrape google search result url with python.

Here's my code

import requests
from bs4 import BeautifulSoup

def search(keyword):        
    html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword)).text
    soup = BeautifulSoup(html, 'html.parser')
    result = []
    for i in soup.find_all('h3', {'class':'r'}):
        result.append(i.find('a', href = True) ['href'][7:])
    return result

search('computer')

Then I can get result. First url of the list is wikipedia.com which is,

'https://en.wikipedia.org/wiki/Computer&sa=U&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQFggTMAA&usg=AOvVaw2nvT-2sO4iJenW_fkyCS3i', '?q=computer&num=100&ie=UTF-8&prmd=ivnsbp&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQsAQIHg'

I want to get clean url, which is 'https://en.wikipedia.org/wiki/Computer' including all the other search result in this case.

How can I modify my codes?




Aucun commentaire:

Enregistrer un commentaire