jeudi 18 novembre 2021

Can't scrape html website for details using python because of imbedded API

I'm again having issue with a Kucoin Page. I've decided to use the main RSS page to get the new Crypto Names and the URL's for therir listings page. I have successfully put the name and link to each into a nested list. I have a loop to iterate through them and do a requests.get(URL). I am trying to parse each page for 2 key info's. The Issue Date and Issue Price. However, it seems that these pages also have some sort of API Load.

response = "https://www.kucoin.com/rss/news?lang=en"    #Kucoin RSS feed.
feed = feedparser.parse(response)   #Put response through feedparser for readability.
Coins_To_Check = []             #Define list.
for post in feed.entries:   #For loop to iterate through all sections.
    if ") Gets Listed on KuCoin" in post.title: #Look for string in the title of feed..
        x = (post.title).index("(") + 1  # Locate '(' in the title section of post.
        y = (post.title).index(")")  # Locate ')' in the title section of post.
        coin = (post.title)[x:y]  # Get coin name from bw. x and y position.
        link = post['link'] #Get link from link section.
        RSS_Coins = [coin, link]  # Create temp list /w coin name, URL
        Coins_To_Check.append(RSS_Coins) #append current list to main list of lists.
for eachURL in Coins_To_Check:  #iterate through list of lists
    CoinURL = eachURL[1]   #assigh CoinURL with current URL.
    CoinURLresponse = requests.get(CoinURL) #get request the URL of Coin listing.

    print(CoinURLresponse)     
# Prints: <Response [200]>

    print(json.dumps(CoinURLresponse, indent=2))
# Prints:     raise TypeError(f'Object of type {o.__class__.__name__} '
         #TypeError: Object of type Response is not JSON serializable

Listings page: https://www.kucoin.com/news/categories/listing

Sample Link: https://www.kucoin.com/news/en-earthfund-1earth-gets-listed-on-kucoin

Kucoin Listing Page Example with Items I'm looking to save to a list to eventually ad to each sub list within Coins_To_Check

Example of F12 details with ever changing URL for each coin

I have tried F12 and get the Xpath but it does not show up in the html of the site. I have tried to get the API link from the Network - Header section, but this seems to be different for each coin so it cannot be static and I dont know how to get that through code.

The Xpaht seems to be the same every time :

  List Date:    //*[@id="root"]/div/div/div[3]/div/div[2]/div[1]/div/div[2]/div/div/ul/li[2]/span/text()
                
  List Price:   //*[@id="root"]/div/div/div[3]/div/div[2]/div[1]/div/div[2]/div/div/table/tbody/tr[4]/td[3]/span

I could just try brutal parsing:

    for post in feed.entries:   #For loop to iterate through all sections
    if ") Gets Listed on KuCoin" in post.title:  #Look for string in the title of feed.
        content = post['content'].pop(0).value  #Get sub.sec.:content - has actual coin/listing details( A LOT of Details).
        l = content.index("some smart location to find price like '$', sometimes this does not work thought when more than one '$' exists ")

Any help is much appreciated! I've been at this for hours 😓.




Aucun commentaire:

Enregistrer un commentaire