I have been trying to write a program to scrape the statistics from www.whoscored.com and create a pandas dataframe.
The main issue I have been having is that the table is spread over 30 pages and the URL doesn't change. For some reason I can see geckodriver clicking onto the next pages of the website, but when it tries to read the table, it just reads the table from the first page.
Ideally, I would like to read all 30 pages and create one pandas dataframe. Here is the code below:
from pandas.io.html import read_html
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://ift.tt/2aTrTDj')
table = driver.find_element_by_xpath('//*[@id="statistics-table-summary"]')
table_html = table.get_attribute('innerHTML')
while True:
page_number = driver.find_element_by_xpath('//*[@id="currentPage"]').get_attribute('value')
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "statistics-table-summary"))
)
except:
print('-')
print('Page ' + page_number)
df = read_html(table_html)[0]
print(df)
next_link = driver.find_element_by_xpath('//*[@id="statistics-paging-summary"]/div/dl[2]/dd[3]')
if page_number == 30:
break
next_link.click()
driver.close()
Aucun commentaire:
Enregistrer un commentaire