mercredi 24 janvier 2018

Web Scraping a Web App (React, Angular, etc) with Python

What I'm Using

  • Python
  • Requests
  • lxml

What I'm Trying to Achieve

I am hoping to create a web scraper that will visit an olark chat transcript page, and scrape the chat from the page. The chat transcripts are behind a login, so the scraper will need to login/create a session then get the information

What I've Done

I've looked at some guides online for writing my own python web scraper, and found one that I used to start the building blocks of the scraper with the following code:

import requests
from lxml import html

USERNAME = "MyUser"
PASSWORD = "MyPWD"

LOGIN_URL = "https://www.olark.com/login"
URL = "TRANSCRIPTURL"

def main():
    session_requests = requests.session()

    # Get login csrf token
    result = session_requests.get(LOGIN_URL)
    tree = html.fromstring(result.text)
    authenticity_token = list(set(tree.xpath("//input[@name='authenticity_token']/@value")))[0]

    # Create payload
    payload = {
        "username": USERNAME,
        "password": PASSWORD,
        "authenticity_token": authenticity_token
    }

    # Perform login
    result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))

    # Scrape url
    result = session_requests.get(URL, headers = dict(referer = URL))
    tree = html.fromstring(result.content)
    conversation = tree.xpath("//body/div[@class='main-container']/div[@class='o2-main-container']")

    print(conversation)

if __name__ == '__main__':
    main()

Where I'm Stuck

When I fill in the information and run the script I am getting an empty array, I believe this is due to the fact that the script is not actually able to locate the correct class since the data is being pulled in from as a react app

Here is the link to the login page: https://www.olark.com/login

I have gotten the authentication to work, now I am struggling with creating the correct xpath to the element I am looking for.

When I use this:

conversation = tree.xpath("//body/div[@class='main-container']/div[@class='o2-main-container']")

It prints: [<Element div at 0x10550b5d0>]

So i know there is an element there, but whe I try this:

conversation = tree.xpath("//body/div[@class='main-container']/div[@class='o2-main-container']/div[@class='']/div[@class='transcripts-app']")

I get an empty array, but when I view the page source, I can see the class there under that parent, Im not sure why I can locate it, other than the fact that it could be a react app - so my scraper does not render the js and actually get the content I need.

I wonder if there is a work around for this? Snapshot the page as a static html and pull the data from that?

Maybe using a headless browser to make the request? but then im not sure how that will work with the authentication? or can i create the session the same way, then use a headless browser to scrape the data after the session has been created? Please help!




Aucun commentaire:

Enregistrer un commentaire