dimanche 1 novembre 2020

(Java) Is there a better way to scrape Youtube search data besides using Youtube API v3

As some people know, the Youtube API v3 limits query pretty fast using the search option. Getting frustrated by that, I decided I could scrape YouTube data myself. Now not being experienced in web scraping, I designed this workaround (shown below). Technically it works, but I feel that it could be way better. Currently, any response that is not misspelled will be responded to around the same time as the Youtube API v3, which I am happy about. It takes longer when a search is spelled wrong, so I have to redo the search (I don't really have to but to make sure I am getting relevant information I would like to).

So my question is, does anyone know a better approach to scraping data from YouTube besides how I am doing it? Like I said before, technically, it works, and it gets the most relevant pieces of information.

Thanks!

private JSONObject searchYoutube(String url) {
        String searchUrl = "https://www.youtube.com" + url.replace(" ", "+");
        try {
            Document doc = Jsoup.connect(searchUrl).get();
            Element element = doc.getElementsByTag("script").stream().filter(element1
                    -> element1.dataNodes().toString().contains("// scraper")).findFirst().orElse(null);

            if(element == null) {
                return null;
            }

            DataNode dataNode = element.dataNodes().get(0);
            String json = dataNode.getWholeData().replace("// scraper_data_begin\n" +
                    "var ytInitialData = ", "")
                    .replace("// scraper_data_end", "");
            JSONObject jsonObject = new JSONObject(json);
            jsonObject = jsonObject.getJSONObject("contents").getJSONObject("twoColumnSearchResultsRenderer").
                    getJSONObject("primaryContents").getJSONObject("sectionListRenderer")
                    .getJSONArray("contents").getJSONObject(0).getJSONObject("itemSectionRenderer");

            int i = 0;
            for(Object object : jsonObject.getJSONArray("contents").toList()) {
                if(object.toString().startsWith("{didYouMeanRenderer")) {
                    System.out.println("LOGGER: searchUrl=" + searchUrl +
                            " didYouMeanRenderer=true");
                    return searchYoutube(jsonObject.getJSONArray("contents")
                            .getJSONObject(0).getJSONObject("didYouMeanRenderer")
                            .getJSONObject("correctedQueryEndpoint")
                            .getJSONObject("commandMetadata")
                            .getJSONObject("webCommandMetadata")
                            .getString("url"));
                }
                if (object.toString().startsWith("{videoRenderer")) {
                    break;
                }
                i++;
            }
            return jsonObject.getJSONArray("contents").getJSONObject(i).getJSONObject("videoRenderer");

        } catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    }



Aucun commentaire:

Enregistrer un commentaire