I'm writing a program that collects news articles from news websites. Currently, I rely on RSS feeds, like this one from the BBC.
However, many major sites that I've come across don't have RSS feeds. What are some other ways that I can collect news articles from websites that don't have RSS feeds?
One option I've considered was to create a web crawler that crawls these websites. However, there are some underlying issues I've thought of with this.
The biggest issue is that each website has a different structure. How would I distinguish what's an article and what's not?
Another issue is categorizing these articles based on topic (politics, world, technology, sports, etc.).
Does anyone have any suggestions?
Aucun commentaire:
Enregistrer un commentaire