mercredi 13 juin 2018

Scrapping URLs from wikipedia in R yielding only half the URL

I am currently trying to extract the URLs from the wikipedia page with a list of chief executive officers, the code then opens the URLs and copies the text into .txt files for me to use. the trouble is the "allurls" object only contains the later half of the URL. For example allurls[1] gives ""/wiki/Pierre_Nanterme"". Thus when I run this code

library("xml2")
library("rvest")

url <- "https://en.wikipedia.org/wiki/List_of_chief_executive_officers"

allurls <- url %>% read_html() %>% html_nodes("td:nth-child(2) a") %>% 
html_attr("href") %>% 
  .[!duplicated(.)]%>%lapply(function(x) 
read_html(x)%>%html_nodes("body"))%>%  
  Map(function(x,y) 
write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
      c(paste("tmp",1:length(.))))

allurls[1]

I get the following error " Error: '/wiki/Pierre_Nanterme' does not exist."




Aucun commentaire:

Enregistrer un commentaire