Jsoup - HTML Parsing in Sketchware - PART 1

Hello guys!

You've probably wondered, how do I get data from a webpage to my app some point in your development, right?

Well, there are popular parsing libraries for HTML content, with Jsoup being the most popular.

The library is lightweight, easy to integrate and no compile or test dependencies.

So, today we would be learning how to use a simple parse on a HTML content.

String html = "<!Doctype html>
<html>
<head>
<body class="bod">
<div id="yes">Hello!</div>
</body>
</head>
</head>";

This is a very simple html document.
Before we start, let's not forget to add necessary imports.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

To parse a string content locally, we do this:

<code>
final Document doc = Jsoup.parse(html);
</code>

This creates a document variable ready for parsing.
So, first of all, let's get the body tag of the document.

We can see the body tag has a special class name and it's the first element with that class (still the only Element)

String body = doc.getElementsByAttributeValue("class","bod").first().html();
</code>


This returns the html string of the first element with the class name bod. 


Want to go further? How about we get plain text, instead? Good idea.

We see there's a div tag with the id of yes, with a plain text inside. What do you do?

Let's try this:

<code>
String text = doc.getElementsByAttributeValue("id","yes").first().text();

OR

String text = doc.select("div").first().text();

To show the value of the attribute, we do this.

String attr = doc.select("div").first().attr("id");
</code>


Yes! You did it! Want to step up your game? Wait for the next tutorial.

Please be careful of content you collect and parse on the internet, as some sites have policies against web scraping and you can get yourself a good lawsuit for attempting to psiphon third party information. 

Be safe!

3 comments:

Note: Only a member of this blog may post a comment.

Powered by Blogger.