web: how to decode and encode web page with python?

mardi 27 janvier 2015

how to decode and encode web page with python?

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:


self.html_doc = self.html_doc.decode('gb2312','ignore')

But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.

web

mardi 27 janvier 2015

how to decode and encode web page with python?

Aucun commentaire:

Enregistrer un commentaire