mercredi 28 janvier 2015

Web scraping nested/hierarchical data

I am trying to read film data from a page like this: http://ift.tt/1EqIOto


If I use a DOM selector to get the a > b elements, I get this list of film titles. So far so good. But I also want to get the dates for each of those films, and then the times for each of those dates.



result: [ "A Most Violent Year", "American Sniper", "Birdman", "Ex Machina", "Foxcatcher", "Into the Woods", "Kingsman: The Secret Service", "Testament of Youth", "The Imitation Game", "The Theory of Everything", "Whiplash", "Wild" ]


I then need to make another query to get the dates in small tags but only when nested within/adjacent to A most violent year. Then get the list of times adjacent to Wednesday 28th January.


I've looked at some node packages like cheerio and noodlejs, but I can't work out how I can get only matches that are within each of the original match.





Aucun commentaire:

Enregistrer un commentaire