web: How to extract text from lengthy HTML?

lundi 24 décembre 2018

How to extract text from lengthy HTML?

I am designing a program that takes the name of a song, searches for it on the web, downloads the html of that page, and figures out what the album, artist, year, genre, etc. are. All of these can be found on a normal web search (assuming it's a popular enough song), but not easily just via the HTML.

The downloaded HTML ranges from about 200,000 to 280,000 characters. I've heard of the Html Agility pack, but I didn't know how fast it could extract the text from this lengthy HTML.

Currently I'm only using the following code:

WebClient client = new WebClient();
string songName = "2 sides of the game";
client.DownloadFile("https://bing.com/search?q=" + songName, "test.html");
string text = File.ReadAllText("test.html", System.Text.Encoding.UTF8);
Console.WriteLine(text);
if (text.Contains("Album:"))
    Console.WriteLine("success");
Console.WriteLine(text.Length);

When I search the text for "Album:", it obviously returns true as it would normally on the page. However, when you search for something like "Year: 2018", it won't work because those are not combined (being in different HTML tags) even though you can find those on the page.

The end goal is to remove all of the HTML in the file and just leave the text. Whether that be by massive splitting, or using an external library, I'm not sure.

web

lundi 24 décembre 2018

How to extract text from lengthy HTML?

Aucun commentaire:

Enregistrer un commentaire