I am trying to teach myself python web scraping. I came across this line of code that I cannot fully understand. The line I do not understand is.
for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
The larger code snippet is here.
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random
pages = set()
random.seed(datetime.datetime.now())
#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
internalLinks = []
#Finds all links that begin with a "/"
for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
if(link.attrs['href'].startswith('/')):
internalLinks.append(includeUrl+link.attrs['href'])
else:
internalLinks.append(link.attrs['href'])
return internalLinks
I understand that the includeUrl function already extracts the scheme and the netloc to create a full fledged link. As an example if we use the following url, we would get that result.
'https://ift.tt/1003CGT' is the url,
'https' is the scheme
stackoverflow.com netloc
So what exactly does this function do if you already have a well formed link like https://www.facebook.com? Does this only work for not complete links and can someone give me an example of how to correctly interpret this function?
Thank you.
Aucun commentaire:
Enregistrer un commentaire