Categorization of text

Some thoughts on how Saplo API can be used to categorize text. In this case we want to extract text from a webpage and then determine what category the text belong to.

Content Extraction

To get as good a result as possible, it's important to insert as clean content as possible into the Saplo API. And by clean I mean that it's only relevant text content - so we basically want to do away with for example HTML menus.

An easy way to do this is to use something like the Readability algorithm, that find the best text content on a web page and cleans away the rest. It's ported to several languages.

It's also important that the content you insert is encoded using UTF-8.

Setup

First of all, you have to have some content to construct the categories themselves, so the API will know what it's comparing against when it's looking at a text.

To do this, you will need to create three different "resources" inside the Saplo API.

First up is to create a collection (which is sort of a "container" for a large number of texts), you will get back an ID when you create the collection. This has to be saved for later use.

Then you add the texts (that are going to make up the categories) to the collection - using the previously mentioned collection ID. You should also save the ID:s returned when you create a new text.

Ok, so say you've inserted 50 articles or webpages containing content about Apple, and 50 articles about automobiles. Somehow we have to tie these texts together. In the Saplo API, we do this using Groups - which is a kind of a subcontainer for texts.
So create appropriate groups and add the correct texts to them (using the texts ID's you got back from the API).

Basic Use

By now you have the framework set up to do categorization, and you want to categorize a web page or a site.

There's only two steps involved in doing this.

First you add the content that you want to categorize as a text (add it to the collection you created in the setup).
Then you use ask the API which Groups are the most similar to that single text - you will be given back a list of your previously created groups, as well as a measure of relevance for each Group against the Text.