mardi 29 décembre 2020

Web Scraping with Python by Ryan Mitchell Chapter 3

I am trying to teach myself python web scraping. I came across this line of code that I cannot fully understand. The line I do not understand is.

for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):

The larger code snippet is here.

from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

I understand that the includeUrl function already extracts the scheme and the netloc to create a full fledged link. As an example if we use the following url, we would get that result.

'https://ift.tt/1003CGT' is the url,

'https' is the scheme

stackoverflow.com netloc

So what exactly does this function do if you already have a well formed link like https://www.facebook.com? Does this only work for not complete links and can someone give me an example of how to correctly interpret this function?

Thank you.




Aucun commentaire:

Enregistrer un commentaire