jeudi 27 février 2020

Slow Web Scraping For Loop - Python

I am trying to scrape the SEC's website to grab company headquarter data. The function is below and it just takes way too long to execute. Is there a better way of running it to make it faster and more efficient? Thanks so much!

def get_hq(cik):
    for i, row in Companies.iterrows():
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
        base_url = 'https://sec.report/CIK/{}'
        cik = Companies['CIK'][i]
        response_object = requests.get(base_url.format(cik), headers = headers)
        raw_html = html.fromstring(response_object.text)
        try:
            hq = raw_html.xpath('//tr[./td[contains(text(),"Business Address")]]/td[2]/text()')[0:]
            Companies['Headquarters'][i] = hq[0:]
        except:
            Companies['Headquarters'][i] = np.nan




Aucun commentaire:

Enregistrer un commentaire