I am trying to scrape full book reviews from the New York Times in order to perform sentiment analysis on them. I am aware of the NY Times API and am using it to get book review URLs, but I need to devise a scraper to get the full article text, as the API only gives a snippet. I believe that nytimes.com has bot protection to prevent bots from scraping the website but I know there are ways to circumvent it.
I found this python scraper that works and can pull full text from nytimes.com, but I would prefer to implement my solution in Go. Should I just port this to Go or is this solution unnecessarily complex? I have already played around with changing the User-Agent header but everything that I do in Go ends in an infinite redirect loop error.
Example:
res, err := http.Get("http://ift.tt/1R0GwnS")
if err != nil {
panic(err)
}
defer res.Body.Close()
fmt.Println(res.Body)
Results in:
http: panic serving [::1]:56464: Get
http://ift.tt/2gcI58v
techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
Any help is appreciated! Thank you!
Aucun commentaire:
Enregistrer un commentaire