web: Getting news articles from websites without RSS feeds?

I'm writing a program that collects news articles from news websites. Currently, I rely on RSS feeds, like this one from the BBC.

However, many major sites that I've come across don't have RSS feeds. What are some other ways that I can collect news articles from websites that don't have RSS feeds?

One option I've considered was to create a web crawler that crawls these websites. However, there are some underlying issues I've thought of with this.

The biggest issue is that each website has a different structure. How would I distinguish what's an article and what's not?

Another issue is categorizing these articles based on topic (politics, world, technology, sports, etc.).

Does anyone have any suggestions?

web

mercredi 10 août 2016

Getting news articles from websites without RSS feeds?

Aucun commentaire:

Enregistrer un commentaire