Twitch has apparently gone out of their way to obscurify their site in such a way that only a full-fledged desktop browser with JavaScript (and possibly more features) can actually display the "bio text" and "rules" texts for a channel/stream. If you look at the HTML code for any given channel/stream, you will not find this content anywhere there. It is added after the page load with JavaScript, somehow. The JavaScript modifies the DOM "after the fact" so that a human browser sees the content, but a naive bot does not.
I have now on three separate occasions carefully combed through every single JSON blob loaded on https://www.twitch.tv/japan_asmr (and other Twitch channels) without finding any mention of any of the texts displayed underneath the live stream rectangle. For example: "Cool subscriber badge next to your name in chat!" It's simply not there. Not in any of the JSON blobs. I've even downloaded all of the requests the browser (Firefox) made as a ".har" file and searched within this file for "Cool subscriber badge next to your name in chat!" and other strings. Nothing is found.
This was originally just something I wanted to do for fun, but now it's grown to become personal. This is now my "Moby Dick". I need to find this out at any cost! It's driving me insane...
Please note that I also have read through their entire API manual and even asked them about this and this feature simply doesn't exist in the API. The short "description" or "title" text is not what I'm talking about. They for some reason really don't want you to be able to grab this info, possibly to prevent scraper bots from harvesting e-mail addresses or something. Whatever. That's not why I wanted it; I was just gonna look for specific strings in the bio text and rules to base further actions on that.
This further highlights the problem of archaic tools such as cURL, which are entirely crippled by JavaScript and can only see the initial HTML code returned. I've even asked the cURL developers and they refuse to even consider implementing JavaScript support, and I don't blame them given what a monumental task it would entail. cURL would have to be completely restructured for this since it would have to make possibly hundreds of extra requests for each request to download all the separate little JSON blobs and things that a "modern" webpage now has. (I wish all sites used 100% static HTML like back in the day...)
Please don't suggest that I use a "headless browser", because I've spent countless hours of my life searching and searching for those, and they just don't exist. Not in any meaningful sense, anyway. Phantom.js is long dead, and even while it was alive, it was so badly documented and required so much weird boilerplate JS code to function at even a basic level that it was driving me insane even for the very limited things I was attempting to do with it. I've not found one single such project that was anything but vaporware, frankly. And even if there were such a thing, which was super easy to use, super secure, etc., I still would want to know what they are doing to hide this data in Twitch. It interests me on a personal level, beyond simply "getting this done".
They must be using some sort of deliberate encryption or encoding or something to make it impossible to easily locate these strings. I've really tried everything I could possibly think of at this point. I hope that somebody out there is far more skilled than I and can somehow figure this out and explain what they are doing.
Aucun commentaire:
Enregistrer un commentaire