I've been playing around with scraping webpages using BeautifulSoup for a few weeks now. An issue I recently ran into, and hadn't seen before is where the content of the webpage is different from what's show as the page's source code and what's given in the url request response.
For example, let's look at yelp. This (http://ift.tt/1HlF5Kd) will bring up all 63k businesses in the Pittsburgh, PA area. If we look at the pages source, we see that it matches the content (if you search for the word showing it finds the code below.)
<span class="pagination-results-window">
Showing 1-10 of 63936
</span>
Now, let's only look at restaurants in the Pittsburgh, PA area. This reduces the number of returned results from 63k to 5k. However, if we look at the pages source, the same code shown above is seen. Moreover, the first returned result in the page source matches the 63k page, not the 5k page. At first, I thought this might be due to mozilla caching webpage content but quickly nixed this idea by scraping the link for the 5k restaurants (http://ift.tt/1HlF5Kf). The result showed that it collected html that generated the page with 63k businesses, not the 5k restaurants that I was expecting.
My question is what is causing this? Is this done intentially by Yelp or this caused by an external reason? I've tried looking this up on my own but I'm unable to find anything that explains this using the verbiage in this question's title.
Let me know if you need more details, I'm happy to provide the few more lines of code that I left out.
Thanks!
Aucun commentaire:
Enregistrer un commentaire