mercredi 30 septembre 2020

Python Request entire HTML page, instead of initially loaded content

I am trying to get some data of reviews publicly available on the PlayStore, and as the API provided only allows to get reviews for one own's apps, I am trying to scrape it from the web.

I am using requests package to get the HTML page of a given app on the PlayStore and will use BeautifulSoup to parse it and save it to file, to then extract the relevant content (rating and comment of each user).

My issue is that not the entire content of the page is retrieved with request.get(URL). Navigating to the "Read All Reviews" on an app on the PlayStore, one gets to a page with all reviews for that app. Unfortunately, though, only a limited set of reviews loads when first loading the page, while the rest of the reviews only loads upon scrolling down to the bottom. By calling request.get(URL) only that limited set of reviews is retrieved, instead of all reviews.

Try navigating to https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true and see older reviews load only when scrolling to the bottom of the page.

Is there a way to access the entire page/trigger the loading of more reviews/simulating the scrolling?

Below is my code:

# get reviews for Thirty Days of Fitness app
URL = "https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true"

# make request
request = requests.get(URL)
# extract HTML text
raw_text = request.text

# parse HTML and prettify
soup = BeautifulSoup(raw_text, 'html.parser')
text = soup.prettify()

# write to file
save_path = './thirtydayfitness_html.txt'
with open(save_path, 'w+', encoding=request.encoding) as f:
    f.write(text)



Aucun commentaire:

Enregistrer un commentaire