mardi 27 janvier 2015

how to decode and encode web page with python?

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:



self.html_doc = self.html_doc.decode('gb2312','ignore')


But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.





Aucun commentaire:

Enregistrer un commentaire