dimanche 21 août 2016

Fetch markup between sets of tags

I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldnt be so easy. The best way to seperate content when I get the page, is to select whats between two sets of h2 tags.

Example: <h2>Title</h2> <div>Some Content</div> <h2>Title</h2>

Here I would want to get the div between the sets of headers. I tried doing this with XPath, but with no luck at all. I am going to look more into XPath because I think thats what I need to use to achieve what I want, but before I look too much into it, I would like to hear what you guys think about it. Is XPath the right way to go or do I have other easier options? I write the application in C# if that makes any difference.




Aucun commentaire:

Enregistrer un commentaire