web: Find pages on a specific website from Python

dimanche 23 décembre 2018

Find pages on a specific website from Python

I am trying to gather some data from a web site for a project and I'd like to automate this with Python. For example, let's say I want to find all the Wikipedia pages in the Northern Sami language. In a browser I would simply type into Google

site:se.wikipedia.org/wiki/

and Google claims to find 13,000 pages whose URLs I would like to collect. I'm pretty new to programming and I've looked at a bunch of earlier questions but couldn't find an answer that worked for me.

At first I thought I would use Google's Custom Search Engine API. I got this to work but apparently it only returns the first 100 results (in installments of 10) and there is no way of changing this, not even by paying Google (you can increase the number of queries per day but not number of results).

Then I thought I would just request the search results as a browser would and get the URLs from the HTML. I can indeed get the HTML with

import urllib.request
url = 'https://www.google.com/search?q=site%3Ase.wikipedia.org%2Fwiki%2F&start=0'
my_header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
req = urllib.request.Request(url, headers=my_header)
content = urllib.request.urlopen(req).read()

which I can iterate by changing the start attribute. However these results only go up to page 31 (~300 results) instead of the 13,000 results Google claims to find. This is true also in the browser. There are certainly more than 300 pages. Is there a way of doing a more thorough search?

web

dimanche 23 décembre 2018

Find pages on a specific website from Python

Aucun commentaire:

Enregistrer un commentaire