web: Extract all text under an h2 tag using scrapy

jeudi 29 octobre 2020

Extract all text under an h2 tag using scrapy

I need to search for an h2 tag with certain value and extract all text following it until the next h2 tag or end of page. so if the page is

<h1 id="DDPSupport-InternalResources"><span style="color: rgb(0,51,102);"><strong>Internal Resources</strong></span></h1>
<h2 id="DDPSupport-GeneralInformation">General Information</h2>
<ul><li><a href="/display/ladtechtme/DDP+overview">DDP overview</a></li>
<li><a href="/display/ladtechtme/DDP+Configuration+guide">DDP Config guide</a></li>
<li><a href="/pages/viewpage.action?pageId=1338281922">Custom DPR</a></li>
<li><a href="/display/ladtechtme/Build+custom+package">Build custom package</a></li>
<li><a href="/display/ladtechtme/Unit+testing">Unit testing</a></li>
<li><a href="/display/ladtechtme/FAQ">FAQ </a></li>
<li><a href="/display/ladtechtme/Misc+BKMs">Misc BKMs</a></li></ul>
<h2 id="DDPSupport-UseCases">Use Cases</h2>
<ul><li><a href="/pages/viewpage.action?pageId=1338281922">Custom DPR </a></li>...

, the expected output is

DDP overview
DDP Config guide
Custom DPR
Build custom package
Unit testing
FAQ
Misc BKMs

I am using the following code:

for head in response.xpath("//div[@class='wiki-content']/h2"):
   if sub == 'General Information':
        lines = head.xpath("//following-sibling::*[count(following-sibling::h2)=1]//text()").extract()
        print(str(lines))

I am getting some result but not the desired one. My output consists of the text of the next h2 tag. Any help would be appreciated.

web

jeudi 29 octobre 2020

Extract all text under an h2 tag using scrapy

Aucun commentaire:

Enregistrer un commentaire