jeudi 30 juin 2016

Can we scrub HTML from user requests and yet deal with special characters like &?

We use BeautifulSoup to scrub HTML from our requests. Assume scrub is a configurable option with varying degree of security to remove HTML or remove dangerous elements. The code is something like:

for k,v in request.form.iteritems():
    soup = BeautifulSoup(value)
    soup.scrub()
    request[k] = str(soup)

It usually works fine for HTML and Text input both. However if the input was simply plain text which has & it breaks.

BeautifulSoup('H&W Insurance') = 'H&W; Insurance'

Ofcourse I can fix it by HTML escaping my input. But it won't work if input really was HTML. And if I do nothing, & is not going to work. Both ways something is going to break. Is there a way I can both scrub the HTML and yet make my & work?

I think the only way this can be solved it to have some conventions in the request to specify the exact type of the request, but the paradox is I am trying to handle an unexpected input, so I can't really specify something. Is this really a solvable problem?




Aucun commentaire:

Enregistrer un commentaire