jeudi 19 janvier 2017

How to extract text from html code?

I'm writing a html parser and I want to extract texts from a html code. The question is between what tags there is a text? Usually should be in the paragraph, but, for example, in google blogger is in meta. This is in purpose of information retrieval, so as a text I mean, if it's an article, the corpus of it. I don't put any code, because code is ready here, just need to adjust delimiters in a parse tree and maybe regexps. Any help?




Aucun commentaire:

Enregistrer un commentaire