mardi 29 décembre 2020

Scraping a lazy loading website using Selenium

I am trying to build a simple Web Scraping Project on Goibibo.com and am trying to work on this page specifically.

Generally, my approach to scraping any dynamic website is to scroll down until I have enough data to scrape, then load the HTML into BeautifulSoup to scrape. But in this case, the website uses some kind of lazy loading where it loads only the elements visible to the user are rendered in HTML and the ones the user scroll pasts are removed. So, if I try to parse the HTML data I find about 6-10 hotels in the soup.

I want to find a way to extract all of the hotels on this page. I initially thought of scrolling to each div of each posting but I am not sure how I can achieve that.

Here's what I have done so far

chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = "chromedriver.exe"

driver = webdriver.Chrome()
link = "https://www.goibibo.com/hotels/hotels-in-delhi-ct/"
driver.get(link)
driver.maximize_window()
sleep(2)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight*(0.75))") #75% as the footer is tall
sleep(2)

showMore = driver.find_elements_by_xpath("//*[@id='root']/div[2]/div/section[2]/button") #Clicking the show more button
driver.execute_script("window.scrollTo(0, document.body.scrollHeight*(0.75))")
showMore[0].click()

for _ in range(10):

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight-2000)") 
    # Hacky way to load more listings as the loading was stuck for some reason until I scrolled again
    sleep(2)

html = driver.page_source
soup = BeautifulSoup(html, "html.parser")



Aucun commentaire:

Enregistrer un commentaire