jeudi 21 octobre 2021

Webscraping not loading full page

I am trying to get the first hundred results from a web page(but getting only the first 20 results instead): https://www.usnews.com/education/best-high-schools/search?national-rank-range-min=1&national-rank-range-max=100

Used the following code:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

url = "https://www.usnews.com/education/best-high-schools/search?national-rank-range-min=1&national-rank-range-max=100"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)  
scroll_pause_time = 1 
screen_height = driver.execute_script("return window.screen.height;")  
i = 1
print(screen_height)

while True:
    driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
    i += 1
    time.sleep(scroll_pause_time)
    # update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
    scroll_height = driver.execute_script("return document.body.scrollHeight;")  
    # Break the loop when the height we need to scroll to is larger than the total scroll height
    if (screen_height) * i > scroll_height:
        break
    
while True:
    try:
        loadmore = driver.find_element_by_id("pager__ButtonContentContainer-sc-1i8e93j-3 zIUhv")
        loadmore.click()
    except:
        print("Reached bottom of page")
        break


html_source = driver.page_source  


soup = BeautifulSoup(html_source,'html.parser') 

...

I tried different ways but nothing is loading the page fully through automation. Even the view source shows the first 20 results only. I am looking to get the first 100 results instead.




Aucun commentaire:

Enregistrer un commentaire