lundi 5 décembre 2016

How can I circumvent bot protection when scraping full NYTimes articles?

I am trying to scrape full book reviews from the New York Times in order to perform sentiment analysis on them. I am aware of the NY Times API and am using it to get book review URLs, but I need to devise a scraper to get the full article text, as the API only gives a snippet. I believe that nytimes.com has bot protection to prevent bots from scraping the website but I know there are ways to circumvent it.

I found this python scraper that works and can pull full text from nytimes.com, but I would prefer to implement my solution in Go. Should I just port this to Go or is this solution unnecessarily complex? I have already played around with changing the User-Agent header but everything that I do in Go ends in an infinite redirect loop error.

Example:

res, err := http.Get("http://ift.tt/1R0GwnS")
if err != nil {
    panic(err)
}
defer res.Body.Close()

fmt.Println(res.Body)

Results in:

http: panic serving [::1]:56464: Get     
http://ift.tt/2gcI58v
techno-by-anthony-marra.html?_r=4: stopped after 10 redirects

Any help is appreciated! Thank you!




Aucun commentaire:

Enregistrer un commentaire