mardi 7 janvier 2020

Getting text from my website pages: Python script

I had a script that used to work - it's been a year since I pulled it out and used it. Problem is I'm now getting an error, and I'm not sure how to solve it. I would also like a way of refining this code, so that I no longer have to list all the web pages, but simply everything under the domain.

I have previously tried to install Beautiful Soup, but for some reason this doesn't work for me. I install it but can't get Spyder/Ananconda to recongize the library exists.

This is the error I am getting:

runfile('F:/CRM/CRM/translations/Python script for text from website pages.py', wdir='F:/CRM/CRM/translations')
Traceback (most recent call last):

  File "<ipython-input-13-2f567a94e1f6>", line 1, in <module>
    runfile('F:/CRM/CRM/translations/Python script for text from website pages.py', wdir='F:/CRM/CRM/translations')

  File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
    execfile(filename, namespace)

  File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "F:/CRM/CRM/translations/Python script for text from website pages.py", line 30, in <module>
    file.write(text)

  File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\encodings\cp1255.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncode
import urllib.request
from inscriptis import get_text
sitelist = ["https://grapaes.com",
"https://grapaes.com/events/past-events",
"https://grapaes.com/about-us-our-story",
"https://grapaes.com/about-us-our-story/story",
"https://grapaes.com/worldwide",
"https://grapaes.com/varieties/arra-branding",
"https://grapaes.com/press",
"https://grapaes.com/press/media",
"https://grapaes.com/press/newsletters",
"https://grapaes.com/about-us-our-story/team",
"https://grapaes.com/varieties",
"https://grapaes.com/events",
"https://grapaes.com/varieties/varieties-red-varieties",
"https://grapaes.com/varieties/varieties-black-varieties",
"https://grapaes.com/varieties/varieties-white-varieties",
"https://grapaes.com/partners",
]
i=0
n=0
length = len(sitelist)
for i in sitelist:
        url = i
        html = urllib.request.urlopen(url).read().decode('utf-8')
        text = get_text(html)
        name = i.replace("/",".")
        name1 = name.replace("https:..grapaes.com.", "site - ")
        file=open(name1 + ".doc","w")
        file.write(text)
        file.close()
        n = n + 1



Aucun commentaire:

Enregistrer un commentaire