mercredi 10 août 2016

Getting news articles from websites without RSS feeds?

I'm writing a program that collects news articles from news websites. Currently, I rely on RSS feeds, like this one from the BBC.

However, many major sites that I've come across don't have RSS feeds. What are some other ways that I can collect news articles from websites that don't have RSS feeds?

One option I've considered was to create a web crawler that crawls these websites. However, there are some underlying issues I've thought of with this.

The biggest issue is that each website has a different structure. How would I distinguish what's an article and what's not?

Another issue is categorizing these articles based on topic (politics, world, technology, sports, etc.).

Does anyone have any suggestions?

Aucun commentaire:

Enregistrer un commentaire