mardi 24 février 2015

Extract url and title in R

I'm having difficulty extracting a specific selection of text from the source code of a website. I can extract the entire list but I only need one country, for example Argentina in this case.


The source code is:



<div class="article-content">
<div class="RichTextElement">
<div><h3 style="background-color: transparent; color: rgb(51, 51, 51);"><span style="font-weight: normal; font-family: Verdana;">Afghanistan - </span><span style="background-color: transparent; font-weight: normal; font-family: Verdana;"><a title="Tax Authority in Afganistan" href="http://mof.gov.af/en" style="background-color: transparent; color: rgb(51, 51, 51);">Ministry of Finance</a><br />Argentina - <a title="Tax Authority in Argentina" href="http://ift.tt/K9SM1t" style="background-color: transparent; color: rgb(51, 51, 51);">Federal Administration of Public Revenues</a><br />


I only need "Federal Administration of Public Revenues" and "http://ift.tt/K9SM1t"


So far I have:



argurl <- readLines("http://ift.tt/1DRjJ9i")

strong <-as.matrix(grep("<br//>",argurl))
strong1starts <- grep("<br //>Argentina",argurl)
rowst1st <- which(grepl(strong1starts, strong))
strong1ends <- strong[rowst1st + 1 ,]-1
data1 <- as.matrix(argurl[strong1starts:strong1ends])




Aucun commentaire:

Enregistrer un commentaire