web: Crawling problem about the page

vendredi 28 février 2020

Crawling problem about the page_source in Python

I try to crawl the product title and other information in the website. There is a category and then a page. In the page, there is a product list. I click one of the products in the list and then I crawl the information(title, etc.). After crawling, I click to go back to the list. I click next one of the products in the list until specific page and category.

This is my logic.

There is a problem that under 'for i in range(0,20):' driver.get(url) html = driver.page_source soup = BeautifulSoup(html, 'html.parser') html and soup has the page_source(information) about the products list, not the information about the product. I tried to get the current URL and get the information but the source keep having the prior page information.

I need a help about it.

def get_search_page_url(category, page):
    return('https://www.missycoupons.com/zero/board.php#id=hotdeals&category={}&page={}'.format(category, page))

def get_prod_items(prod_items):
    prod_data = []

    for prod_item in prod_items:

        try:
            title = prod_item.select('div.rp-list-table-row.normal.post')[0].text.strip()
        except:
            title =''
        prod_data.append([title])

    return prod_data

#####
driver = webdriver.Chrome('C:/chromedriver.exe')

driver.implicitly_wait(10)

prod_data_total =[]
for category in range(1, 2):
    for page in range(1, 2): 

        url = get_search_page_url(category, page)
        driver.get(url)

        time.sleep(15)

        for i in range(0,20):
            driver.find_elements_by_css_selector("div.rp-list-table-cell.board-list.mc-l-subject>a")[i].click()
            url=driver.current_url
            driver.get(url)
            html = driver.page_source
            soup = BeautifulSoup(html, 'html.parser')

            prod_items = soup.select('div#mc_view_title')
            prod_item_list = get_prod_items(prod_items)

            prod_data_total = prod_data_total + prod_item_list

            driver.back()
            time.sleep(5)

web

vendredi 28 février 2020

Crawling problem about the page_source in Python

Aucun commentaire:

Enregistrer un commentaire