What I'm Using
- Python
- Requests
- lxml
What I'm Trying to Achieve
I am hoping to create a web scraper that will visit an olark chat transcript page, and scrape the chat from the page. The chat transcripts are behind a login, so the scraper will need to login/create a session then get the information
What I've Done
I've looked at some guides online for writing my own python web scraper, and found one that I used to start the building blocks of the scraper with the following code:
import requests
from lxml import html
USERNAME = "MyUser"
PASSWORD = "MyPWD"
LOGIN_URL = "https://www.olark.com/login"
URL = "TRANSCRIPTURL"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='authenticity_token']/@value")))[0]
# Create payload
payload = {
"username": USERNAME,
"password": PASSWORD,
"authenticity_token": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
conversation = tree.xpath("//body/div[@class='main-container']/div[@class='o2-main-container']")
print(conversation)
if __name__ == '__main__':
main()
Where I'm Stuck
When I fill in the information and run the script I am getting an empty array, I believe this is due to the fact that the script is not actually able to locate the correct class since the data is being pulled in from as a react app
Here is the link to the login page: https://www.olark.com/login
I have gotten the authentication to work, now I am struggling with creating the correct xpath to the element I am looking for.
When I use this:
conversation = tree.xpath("//body/div[@class='main-container']/div[@class='o2-main-container']")
It prints: [<Element div at 0x10550b5d0>]
So i know there is an element there, but whe I try this:
conversation = tree.xpath("//body/div[@class='main-container']/div[@class='o2-main-container']/div[@class='']/div[@class='transcripts-app']")
I get an empty array, but when I view the page source, I can see the class there under that parent, Im not sure why I can locate it, other than the fact that it could be a react app - so my scraper does not render the js and actually get the content I need.
I wonder if there is a work around for this? Snapshot the page as a static html and pull the data from that?
Maybe using a headless browser to make the request? but then im not sure how that will work with the authentication? or can i create the session the same way, then use a headless browser to scrape the data after the session has been created? Please help!
Aucun commentaire:
Enregistrer un commentaire