lundi 6 mars 2017

Using Selenium to scrape a table across multiple pages when the url doesn't change

I have been trying to write a program to scrape the statistics from www.whoscored.com and create a pandas dataframe.

The main issue I have been having is that the table is spread over 30 pages and the URL doesn't change. For some reason I can see geckodriver clicking onto the next pages of the website, but when it tries to read the table, it just reads the table from the first page.

Ideally, I would like to read all 30 pages and create one pandas dataframe. Here is the code below:

from pandas.io.html import read_html
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://ift.tt/2aTrTDj')

table = driver.find_element_by_xpath('//*[@id="statistics-table-summary"]')
table_html = table.get_attribute('innerHTML')

while True:
    page_number = driver.find_element_by_xpath('//*[@id="currentPage"]').get_attribute('value')
    try:
        element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "statistics-table-summary"))
        )
    except:
        print('-')
    print('Page ' + page_number)
    df = read_html(table_html)[0]
    print(df)

    next_link = driver.find_element_by_xpath('//*[@id="statistics-paging-summary"]/div/dl[2]/dd[3]')
    if page_number == 30:
        break
    next_link.click()

driver.close()




Aucun commentaire:

Enregistrer un commentaire