As some people know, the Youtube API v3 limits query pretty fast using the search option. Getting frustrated by that, I decided I could scrape YouTube data myself. Now not being experienced in web scraping, I designed this workaround (shown below). Technically it works, but I feel that it could be way better. Currently, any response that is not misspelled will be responded to around the same time as the Youtube API v3, which I am happy about. It takes longer when a search is spelled wrong, so I have to redo the search (I don't really have to but to make sure I am getting relevant information I would like to).
So my question is, does anyone know a better approach to scraping data from YouTube besides how I am doing it? Like I said before, technically, it works, and it gets the most relevant pieces of information.
Thanks!
private JSONObject searchYoutube(String url) {
String searchUrl = "https://www.youtube.com" + url.replace(" ", "+");
try {
Document doc = Jsoup.connect(searchUrl).get();
Element element = doc.getElementsByTag("script").stream().filter(element1
-> element1.dataNodes().toString().contains("// scraper")).findFirst().orElse(null);
if(element == null) {
return null;
}
DataNode dataNode = element.dataNodes().get(0);
String json = dataNode.getWholeData().replace("// scraper_data_begin\n" +
"var ytInitialData = ", "")
.replace("// scraper_data_end", "");
JSONObject jsonObject = new JSONObject(json);
jsonObject = jsonObject.getJSONObject("contents").getJSONObject("twoColumnSearchResultsRenderer").
getJSONObject("primaryContents").getJSONObject("sectionListRenderer")
.getJSONArray("contents").getJSONObject(0).getJSONObject("itemSectionRenderer");
int i = 0;
for(Object object : jsonObject.getJSONArray("contents").toList()) {
if(object.toString().startsWith("{didYouMeanRenderer")) {
System.out.println("LOGGER: searchUrl=" + searchUrl +
" didYouMeanRenderer=true");
return searchYoutube(jsonObject.getJSONArray("contents")
.getJSONObject(0).getJSONObject("didYouMeanRenderer")
.getJSONObject("correctedQueryEndpoint")
.getJSONObject("commandMetadata")
.getJSONObject("webCommandMetadata")
.getString("url"));
}
if (object.toString().startsWith("{videoRenderer")) {
break;
}
i++;
}
return jsonObject.getJSONArray("contents").getJSONObject(i).getJSONObject("videoRenderer");
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
Aucun commentaire:
Enregistrer un commentaire