What I'm Using
- Python
- Requests
- lxml
What I'm Trying to Achieve
I am hoping to create a web scraper that will visit an olark chat transcript page, and scrape the chat from the page. The chat transcripts are behind a login, so the scraper will need to login/create a session then get the information
What I've Done
I've looked at some guides online for writing my own python web scraper, and found one that I used to start the building blocks of the scraper with the following code:
import requests
from lxml import html
USERNAME = "<USERNAME>"
PASSWORD = "<PASSWORD>"
LOGIN_URL = "https://www.olark.com/login"
URL = "URL HERE"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]
# Create payload
payload = {
"username": USERNAME,
"password": PASSWORD,
"csrfmiddlewaretoken": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//div[@class='repo-list--repo']/a/text()")
print(bucket_names)
if __name__ == '__main__':
main()
Where I'm Stuck
When I fill in the information and run the script I am getting an empty array, I believe this is due to the fact that the script is not actually logging in, since I did not setup the authenticity token correctly. If someone could help correct the code based on what I have supplied here and make it a functioning scraper, it would be amazing.
Here is the link to the login page: https://www.olark.com/login
According to the guide I was following, this is where I would get the necessary information from for the authenticity token.
Once on the transcript page this is what you see:
<div class="transcript-conversation">
<div class="conversation-item-info">
<div class="conversation-item-info-message">
<div class="conversation-item-info-text">
<div class="conversation-item-info-title">Lorem ipsum</div>
<div class="conversation-item-info-description">
<!-- react-text: 45 -->Lorem ipsum
<!-- /react-text -->
</div>
</div>
<div class="conversation-item-info-date">Lorem ipsum</div>
</div>
</div>
<div class="conversation-item-info">
<div class="conversation-item-info-message">
<div class="conversation-item-info-text">
<div class="conversation-item-info-title">Lorem ipsum</div>
<div class="conversation-item-info-description"><a class="transcript-link" href="" title=""</a></div>
</div>
<div class="conversation-item-info-date"></div>
</div>
</div>
<div class="conversation-item-info">
<div class="conversation-item-info-message">
<div class="conversation-item-info-text">
<div class="conversation-item-info-title">Lorem ipsum</div>
<div class="conversation-item-info-description">
<!-- react-text: 59 -->- Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-info-description">
<!-- react-text: 61 -->- Lorem ipsum
<!-- /react-text --><a class="transcript-link" href="" title="">Lorem ipsum</a></div>
<div class="conversation-item-info-description">
<!-- react-text: 64 -->- Lorem ipsum
<!-- /react-text -->
</div>
</div>
<div class="conversation-item-info-date"></div>
</div>
</div>
<div class="conversation-item-visitor">
<div class="conversation-item-visitor-name">Lorem ipsum</div>
<div class="conversation-item-visitor-message">
<div class="conversation-item-visitor-text">
<!-- react-text: 70 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-visitor-date">11:47:15</div>
</div>
</div>
<div class="conversation-item-visitor">
<div class="conversation-item-visitor-name">Lorem ipsum</div>
<div class="conversation-item-visitor-message">
<div class="conversation-item-visitor-text">
<!-- react-text: 76 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-visitor-date">11:47:28</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 82 -->Lorem ipsum.
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:48:20</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 88 -->Lorem ipsum.
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:48:29</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Rebecca</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 94 -->Lorem ipsum.
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:48:35</div>
</div>
</div>
<div class="conversation-item-visitor">
<div class="conversation-item-visitor-name">Lorem ipsum</div>
<div class="conversation-item-visitor-message">
<div class="conversation-item-visitor-text">
<!-- react-text: 100 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-visitor-date">11:48:49</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 106 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:48:55</div>
</div>
</div>
<div class="conversation-item-visitor">
<div class="conversation-item-visitor-name">Lorem ipsum</div>
<div class="conversation-item-visitor-message">
<div class="conversation-item-visitor-text">
<!-- react-text: 112 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-visitor-date">11:49:05</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 118 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:49:44</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 124 -->Lorem ipsum.
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:49:55</div>
</div>
</div>
<div class="conversation-item-visitor">
<div class="conversation-item-visitor-name">Lorem ipsum</div>
<div class="conversation-item-visitor-message">
<div class="conversation-item-visitor-text">
<!-- react-text: 130 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-visitor-date">11:50:14</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 136 -->Lorem ipsum.
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:50:32</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 142 -->Lorem ipsum
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:50:39</div>
</div>
</div>
<div class="conversation-item-visitor">
<div class="conversation-item-visitor-name">Lorem ipsum</div>
<div class="conversation-item-visitor-message">
<div class="conversation-item-visitor-text">
<!-- react-text: 148 -->Lorem ipsum!
<!-- /react-text -->
</div>
<div class="conversation-item-visitor-date">11:50:49</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 154 -->Lorem ipsum!
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:50:56</div>
</div>
</div>
<div class="conversation-item-operator">
<div class="conversation-item-operator-name">Lorem ipsum</div>
<div class="conversation-item-operator-message">
<div class="conversation-item-operator-text">
<!-- react-text: 160 -->Lorem ipsum!
<!-- /react-text -->
</div>
<div class="conversation-item-operator-date">11:50:58</div>
</div>
</div>
<div class="conversation-item-command">
<div class="conversation-item-command-message">
<div class="conversation-item-command-text"><span class="conversation-item-command-title">Used command: </span><span class="conversation-item-command-description"><!-- react-text: 167 -->!end<!-- /react-text --></span></div>
<div class="conversation-item-command-date">11:50:59</div>
</div>
</div>
<div class="conversation-item-info">
<div class="conversation-item-info-message">
<div class="conversation-item-info-text">
<div class="conversation-item-info-title">Visitor left the page or the chat session was ended.</div>
</div>
<div class="conversation-item-info-date"></div>
</div>
</div>
</div>
I hoping to scrape the entire transcript: transcript-conversation
Aucun commentaire:
Enregistrer un commentaire