dimanche 18 août 2019

Web scraping using Python and Beautiful Soup for /post-sitemap.xml/

I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.

I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml

import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

class wordpress_ext_url_cleanup(object):
    def __init__(self,wp_url):
        self.wp_url_raw = wp_url
        self.wp_url = wp_url + '/post-sitemap.xml/'

    def identify_ext_url(self):
        html = requests.get(self.wp_url)
        print(self.wp_url)
        print(html.text)
        soup = BeautifulSoup(html.text,'lxml')
        #print(soup.get_text())
        raw_data = soup.find_all('tr')
        print (raw_data)
        #for link in raw_data:
            #print(link.get("href"))

def main():
    print ("Inside Main Function");
    url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
    first_call = wordpress_ext_url_cleanup(url)
    first_call.identify_ext_url()


if __name__ == '__main__':
    main()

I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.




Aucun commentaire:

Enregistrer un commentaire