mardi 30 mars 2021

Webscraping all text in an article in R

I am creating a webscraper where I am gathering the full text of an article. As so right now I have not been able to grab the html needed for the full text of the arrticle. The text should later be outputted onto a csv with the text all in one row

My output is currently blank

My program is below:

library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)

fullText <- function(parsedDocument){
  fullText <- parsedDocument %>%
    html_nodes("a.article-body") %>%
    html_text() %>%
    return(fullText)
}

#main function with input as parameter year
testFullText <- function(DOIurl){
  parsedDocument <- read_html(DOIurl)
  DNAresearch <- data.frame()
  allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
  DNAresearch <-  rbind(DNAresearch, allData)
  write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")



Aucun commentaire:

Enregistrer un commentaire