mercredi 25 octobre 2017

Web Scraping Business Area on Google in R

I am trying to solve the following problem in R:

  1. I have a long list of addresses, names, post codes etc. for businesses
  2. I want to find out the business area of each of this businesses. In this example the key word I am looking to parse is: Piano store in Munich, Germany

With the help of other entries I am able to extract the top search links but I cannot adapt the HTML nodes and attributes to find the "business area" term returned by Google Search.

    library(XML)
    library(bitops)
    library(RCurl)
    getGoogleURL <- function(search.term, domain = '.co.uk', quotes=TRUE) 
    {
    search.term <- gsub(' ', '%20', search.term)
    if(quotes) search.term <- paste('%22', search.term, '%22', sep='') 
     getGoogleURL <- paste('http://www.google', domain, '/search?q=',
                    search.term, sep='')
      }

     getGoogleLinks <- function(google.url) {
      doc <- getURL(google.url, httpheader = c("User-Agent" = "R
                                       (2.10.0)"))
      html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
                    (...){})
      nodes <- getNodeSet(html, "//h3[@class='r']//a")
      return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]]))
      }

      search.term <- "Klavierhaus Vogel e. K."
      quotes <- "TRUE"
      search.url <- getGoogleURL(search.term=search.term, quotes=quotes)

      links <- getGoogleLinks(search.url)

As far as I understand I have to adapt the following part: nodes <- getNodeSet(html, "//h3[@class='r']//a")return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]]))}

But I do not find the appropriate node and key reference for the "business key word" on Google. Looking the phrase up manually I would find the "Piano store in Munich, Germany" reference.

Any help is highly appreciated.




Aucun commentaire:

Enregistrer un commentaire