web: Scrapy (python) responses alternating between bytes and utf8

mardi 13 août 2019

Scrapy (python) responses alternating between bytes and utf8

I am using scrapy web crawler, when scraping a site, the responses keep alternating between html and bytes, which are encoded utf8 but i receive and error when trying to decode them.

I have tried multiple different headers for encoding, accepting gzip, deflate, text/html;charset=utf-8, br but they keep giving me the same issue.

    "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "text/html;charset=utf-8",

To receive html instead of bytes, (here is a snippet of bytes received compared to html). expected response

b'<!DOCTYPE html><html lang="en" xmlns:og="http://opengraphprotocol.org/schema/"><head><link rel="appl'

Actual response

b'\x93b\x92\x12)\x1d@I\xc1y\x00\x00h\xeb\x9d\x875\xaa\xd7\xc0\xfc\xb0q\x00\x00\xf0\x15\x0f\xdbF\xb1\xf3\x0f

web

mardi 13 août 2019

Scrapy (python) responses alternating between bytes and utf8

Aucun commentaire:

Enregistrer un commentaire