mercredi 28 septembre 2016

Excel VBA: Extract Image Src attribute from HTML as string

I am trying to scrape my employers website to extract images from their Blog post en mass. I have started creating a scraping tool in Excel using VBA.

(We don't have access to the SQL database)

I have setup a work sheet that contains a list of post identifiers in column A and the URL of the post in column B.

My VBA script so far runs through the list of URL's in column B extracts the HTML from a Tag on the page by ID, using getElementById and pastes the resulting output as a string into column C.

I am now at the point where I am trying to figure out how to extract the src attribute from every image in the resulting output and paste it into the relevant columns. I can't for the life of me come up with an easy solution. I am not very familiar with RegEx and am struggling with Excel's built in string functions.

The end game is to get the macro to run through each image URL and save the image to disk with a filename format like "{Event No.}-{Image Number}".jpg

Any help would be much appreciated.

Worksheet setup

Sub Get_Image_SRC()

Dim sht As Worksheet
Dim LastRow As Long
Dim i As Integer
Dim url As String
Dim IE As Object
Dim objElement As Object
Dim objCollection As Object
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement


Set sht = ThisWorkbook.Worksheets("Sheet1")
'Ctrl + Shift + End
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
For i = 2 To LastRow
    url = Cells(i, "C").Value
    MsgBox (url)
    IE.navigate url
    Application.StatusBar = url & " is loading..."
    Do While IE.readyState = 4: DoEvents: Loop
    Do Until IE.readyState = 4: DoEvents: Loop
    Application.StatusBar = url & " Loaded"
    If Cells(i, "B").Value = "WEBNEWS" Then
        Cells(i, "D").Value = IE.document.getElementById("NewsDetail").outerHTML
       Else
        Cells(i, "D").Value = IE.document.getElementById("ReviewContainer").outerHTML
    End If



Next i

Set IE = Nothing
Set objElement = Nothing
Set objCollection = Nothing

End Sub

Example resulting HTML:

<div id=""NewsDetail""><div class=""NewsDetailTitle"">Video: Race Face Behind the Scenes Tour</div><div class=""NewsDetailImage""><img alt=""HeadlinesThumbnail.jpg"" src=""/ImageHandler/6190/515/1000/0/""></div>    <div class=""NewsDetailBody"">Pinkbike posted this video a while ago, if you missed it, its' definitely worth a watch. 

Ken from Camp of Champions took a look at their New Westminster factory last year which gives a look at the production, people and culture of Race Face. The staff at Race Face are truly their greatest asset they had, best wishes to everyone!

<p><center><object width=""500"" height=""281""><param name=""allowFullScreen"" value=""true""><param name=""AllowScriptAccess"" value=""always""><param name=""movie"" value=""http://ift.tt/2dr45JB""><embed width=""500"" height=""281"" src=""http://ift.tt/2dr45JB"" type=""application/x-shockwave-flash"" allowscriptaccess=""always"" allowfullscreen=""true""></object></center><p></p>


</div><div class=""NewsDate"">Published Friday, 25 November 2011</div></div>"

My current references




Aucun commentaire:

Enregistrer un commentaire