I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.
I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.
Aucun commentaire:
Enregistrer un commentaire