mercredi 24 janvier 2018

Python - Web Scraper with authentication

What I'm Using

  • Python
  • Requests
  • lxml

What I'm Trying to Achieve

I am hoping to create a web scraper that will visit an olark chat transcript page, and scrape the chat from the page. The chat transcripts are behind a login, so the scraper will need to login/create a session then get the information

What I've Done

I've looked at some guides online for writing my own python web scraper, and found one that I used to start the building blocks of the scraper with the following code:

import requests
from lxml import html

USERNAME = "<USERNAME>"
PASSWORD = "<PASSWORD>"

LOGIN_URL = "https://www.olark.com/login"
URL = "URL HERE"

def main():
    session_requests = requests.session()

    # Get login csrf token
    result = session_requests.get(LOGIN_URL)
    tree = html.fromstring(result.text)
    authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]

    # Create payload
    payload = {
        "username": USERNAME, 
        "password": PASSWORD, 
        "csrfmiddlewaretoken": authenticity_token
    }

    # Perform login
    result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))

    # Scrape url
    result = session_requests.get(URL, headers = dict(referer = URL))
    tree = html.fromstring(result.content)
    bucket_names = tree.xpath("//div[@class='repo-list--repo']/a/text()")

    print(bucket_names)

if __name__ == '__main__':
    main()

Where I'm Stuck

When I fill in the information and run the script I am getting an empty array, I believe this is due to the fact that the script is not actually logging in, since I did not setup the authenticity token correctly. If someone could help correct the code based on what I have supplied here and make it a functioning scraper, it would be amazing.

Here is the link to the login page: https://www.olark.com/login

According to the guide I was following, this is where I would get the necessary information from for the authenticity token.

Once on the transcript page this is what you see:

<div class="transcript-conversation">
  <div class="conversation-item-info">
    <div class="conversation-item-info-message">
      <div class="conversation-item-info-text">
        <div class="conversation-item-info-title">Lorem ipsum</div>
        <div class="conversation-item-info-description">
          <!-- react-text: 45 -->Lorem ipsum
          <!-- /react-text -->
        </div>
      </div>
      <div class="conversation-item-info-date">Lorem ipsum</div>
    </div>
  </div>
  <div class="conversation-item-info">
    <div class="conversation-item-info-message">
      <div class="conversation-item-info-text">
        <div class="conversation-item-info-title">Lorem ipsum</div>
        <div class="conversation-item-info-description"><a class="transcript-link" href="" title=""</a></div>
      </div>
      <div class="conversation-item-info-date"></div>
    </div>
  </div>
  <div class="conversation-item-info">
    <div class="conversation-item-info-message">
      <div class="conversation-item-info-text">
        <div class="conversation-item-info-title">Lorem ipsum</div>
        <div class="conversation-item-info-description">
          <!-- react-text: 59 -->- Lorem ipsum
          <!-- /react-text -->
        </div>
        <div class="conversation-item-info-description">
          <!-- react-text: 61 -->- Lorem ipsum
          <!-- /react-text --><a class="transcript-link" href="" title="">Lorem ipsum</a></div>
        <div class="conversation-item-info-description">
          <!-- react-text: 64 -->- Lorem ipsum
          <!-- /react-text -->
        </div>
      </div>
      <div class="conversation-item-info-date"></div>
    </div>
  </div>
  <div class="conversation-item-visitor">
    <div class="conversation-item-visitor-name">Lorem ipsum</div>
    <div class="conversation-item-visitor-message">
      <div class="conversation-item-visitor-text">
        <!-- react-text: 70 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-visitor-date">11:47:15</div>
    </div>
  </div>
  <div class="conversation-item-visitor">
    <div class="conversation-item-visitor-name">Lorem ipsum</div>
    <div class="conversation-item-visitor-message">
      <div class="conversation-item-visitor-text">
        <!-- react-text: 76 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-visitor-date">11:47:28</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 82 -->Lorem ipsum.
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:48:20</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 88 -->Lorem ipsum.
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:48:29</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Rebecca</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 94 -->Lorem ipsum.
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:48:35</div>
    </div>
  </div>
  <div class="conversation-item-visitor">
    <div class="conversation-item-visitor-name">Lorem ipsum</div>
    <div class="conversation-item-visitor-message">
      <div class="conversation-item-visitor-text">
        <!-- react-text: 100 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-visitor-date">11:48:49</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 106 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:48:55</div>
    </div>
  </div>
  <div class="conversation-item-visitor">
    <div class="conversation-item-visitor-name">Lorem ipsum</div>
    <div class="conversation-item-visitor-message">
      <div class="conversation-item-visitor-text">
        <!-- react-text: 112 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-visitor-date">11:49:05</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 118 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:49:44</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 124 -->Lorem ipsum.
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:49:55</div>
    </div>
  </div>
  <div class="conversation-item-visitor">
    <div class="conversation-item-visitor-name">Lorem ipsum</div>
    <div class="conversation-item-visitor-message">
      <div class="conversation-item-visitor-text">
        <!-- react-text: 130 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-visitor-date">11:50:14</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 136 -->Lorem ipsum.
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:50:32</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 142 -->Lorem ipsum
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:50:39</div>
    </div>
  </div>
  <div class="conversation-item-visitor">
    <div class="conversation-item-visitor-name">Lorem ipsum</div>
    <div class="conversation-item-visitor-message">
      <div class="conversation-item-visitor-text">
        <!-- react-text: 148 -->Lorem ipsum!
        <!-- /react-text -->
      </div>
      <div class="conversation-item-visitor-date">11:50:49</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 154 -->Lorem ipsum!
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:50:56</div>
    </div>
  </div>
  <div class="conversation-item-operator">
    <div class="conversation-item-operator-name">Lorem ipsum</div>
    <div class="conversation-item-operator-message">
      <div class="conversation-item-operator-text">
        <!-- react-text: 160 -->Lorem ipsum!
        <!-- /react-text -->
      </div>
      <div class="conversation-item-operator-date">11:50:58</div>
    </div>
  </div>
  <div class="conversation-item-command">
    <div class="conversation-item-command-message">
      <div class="conversation-item-command-text"><span class="conversation-item-command-title">Used command: </span><span class="conversation-item-command-description"><!-- react-text: 167 -->!end<!-- /react-text --></span></div>
      <div class="conversation-item-command-date">11:50:59</div>
    </div>
  </div>
  <div class="conversation-item-info">
    <div class="conversation-item-info-message">
      <div class="conversation-item-info-text">
        <div class="conversation-item-info-title">Visitor left the page or the chat session was ended.</div>
      </div>
      <div class="conversation-item-info-date"></div>
    </div>
  </div>
</div>

I hoping to scrape the entire transcript: transcript-conversation




Aucun commentaire:

Enregistrer un commentaire