web: Extract URL from HTML/Text but if URL only shows partial like "/secondpage.html"?

samedi 3 octobre 2015

Extract URL from HTML/Text but if URL only shows partial like "/secondpage.html"?

I'm trying to extract a URL from a HTML snippet in string format.

I've been using regex to retrieve the part between href=" and ". However, I noticed that in some cases href links to pages within the website without containing the root URL. For example, a snippet can be like:

<div class="textcontent" id="desc">
<br>
<a rel="nofollow" href="/confirm/url/aHR0cHLy9yYZy50bw%3D%3D/"  class="ajaxLink">link</a><br>

Instead of the more usual:

<a href="google.com">Google</a>

Where I can just use this regex to narrow down my results:

/href\n*=\n*".*?"/

I looked around StackOverflow, and saw a few posts about this (extracting URLs from html/text), and saw a mention of using an external library like JSoup. This is for a Chrome Extension, so I'm hoping to keep it lightweight (if that might be an issue).

Are there any good solutions for this "partial URL" problem? Would it be best to just check and append to the URL if root is missing, or would using external library like JSoup be more advised?

web

samedi 3 octobre 2015

Extract URL from HTML/Text but if URL only shows partial like "/secondpage.html"?

Aucun commentaire:

Enregistrer un commentaire