web: Web Scraping with Python by Ryan Mitchell Chapter 3

mardi 29 décembre 2020

Web Scraping with Python by Ryan Mitchell Chapter 3

I am trying to teach myself python web scraping. I came across this line of code that I cannot fully understand. The line I do not understand is.

for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):

The larger code snippet is here.

from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

I understand that the includeUrl function already extracts the scheme and the netloc to create a full fledged link. As an example if we use the following url, we would get that result.

'https://ift.tt/1003CGT' is the url,

'https' is the scheme

stackoverflow.com netloc

So what exactly does this function do if you already have a well formed link like https://www.facebook.com? Does this only work for not complete links and can someone give me an example of how to correctly interpret this function?

Thank you.

web

mardi 29 décembre 2020

Web Scraping with Python by Ryan Mitchell Chapter 3

Aucun commentaire:

Enregistrer un commentaire