How to do text classification with Javascript
11 min read

How to do text classification with Javascript

Text classification and machine learning with Javascript: Natural.js, Brain.js, TensorFlow.js.

A few months ago I tried to find information about doing some natural language processing with Javascript. There was not much of it. Most of the time, I stumbled upon tutorials on how to do so in Python. I'm writing this article in the hope to help someone do the same with Javascript. At least, try to. Javascript ecosystem is large, yet machine learning is mostly done in Python. For some custom(complicated) cases, you maybe will decide not to use Javascript. I'll explain why you may do so.

I want to point out that I'm not a machine learning engineer. I'll cover simple cases with no deep explanations of the underlying algorithms.

There are manageable cases when you may try JS packages that do the classification. In some others, and if you understand ML concepts, you may create custom models using TensorFlow.js.

My case seemed simple. I wanted to classify potential business problems(opportunities) for my Reddit advanced search tool. I'll tell you soon how it went, once we cover the tools. Let's start with simple cases.

Natural.js

It's a package for Node.js that helps deal with natural language. It has many useful built-in helpers. For example, it can do a sentiment analysis from the box and without any setup. Let's install it:

$ npm install --save natural
Easy sentiment analysis, right?
const { SentimentAnalyzer, PorterStemmer } = require('natural');

const analyzer = new SentimentAnalyzer("English", PorterStemmer, "afinn");
const result = analyzer.getSentiment(["I", "love", "cakes"]);

console.log(result); // 0.66
// values greater than 0 indicate a positive sentiment
// values smaller than 0 indicate a negative sentiment

Yes, it's easy. PorterStemmer is a transformation function that converts words to theirs stems. To their original form, simply put. We pass an array of words to getSentiment function, but we may use built-in tokenizers to do so automatically.

Where is a promised text classification, Lebowski?

I wanted to show the simplicity of usage, without even training some complex algorithms. Now let's see how it deals with text classification.

The package supports the Naive Bayes classifier and logistic regression. They work differently, so try each one and see what fits your case better.

const { BayesClassifier } = require('natural');

const classifier = new BayesClassifier();

classifier.addDocument('buy our limited offer', 'spam');
classifier.addDocument('grow your audience with us', 'spam');
classifier.addDocument('our company provides a great deal', 'spam');
classifier.addDocument('I like to read books and watch movies', 'regular');
classifier.addDocument('My friend likes to walk near the mall', 'regular');
classifier.addDocument('Pizza was awesome yesterday', 'regular');

classifier.train();

console.log(classifier.classify('we would like to propose our offer')); // spam
console.log(classifier.classify('I\'m feeling tired and want to watch something')); // regular

Usually, you need a lot of examples. With a small amount of them, any method you choose(this library or a custom model) will output not the best results. Pay vast attention to your data, it's a major element in text classification. Maybe Natural.js will cover your case and you can finish the reading. If you need a more custom setup(if you think so, review your data again), read further.

Brain.js

This library helps you to build neural networks. Natural works with more simple algorithms. Neural networks are many algorithms that work as one, simply saying. They reflect the behavior of biological neurons that are great at recognizing patterns.

Now you can customize the algorithms. Specifically, you can build your own neural networks architectures - specify how many layers you need, activation functions, learning rate, and other parameters. This is where it gets trickier. There are no "gold rules" at building neural net architectures. The process greatly varies from a use case. We may use the default options in the cases like defining a color from RGB params:

const brain = require('brain.js');

// Build a default neural net
const net = new brain.NeuralNetwork();

// This is where we specify our data: input and the result(output)
// the data is an array of examples(input and output).
// And then the network trains on them.
net.train([
  // we tell it: if "r" from RGB scheme is 0.03, and "g" is 0.7
  // then the output should be "black"
  { input: { r: 0.03, g: 0.7 }, output: { black: 1 } },
    
  // notice that we skip some values from RGB, in this case we
  // missed "g"
  { input: { r: 0.16, b: 0.2 }, output: { white: 1 } },
    
  // here we point out all the RGB values
  { input: { r: 0.5, g: 0.5, b: 1.0 }, output: { white: 1 } },
]);

// This is how we run the network to get a prediction
const output = net.run({ r: 1, g: 0.4, b: 0 }); // { white: 0.81, black: 0.18 }

It's a powerful way to build such a network without understanding the underlying concepts, data normalization. Just point out a few examples and you're done. However, in reality, you need more examples for better precision.

Transforming text to numeric vectors

Now we're talking about data normalization. For text classification, we need to transform the text into numeric values because Brain.js doesn't have custom data transformation flow for regular neural nets, yet you may try it for LSTM, for example. Why convert strings to numbers? Neural networks training is a process of many math calculations, which require numbers, not other data types. You might use raw strings, they would be converted to their numeric representations, however, not to the format you(and the algorithms) probably want. What these "algorithms" do is figure out the patterns of the input to build a function that can calculate the output based on the input. So it's important how you do this transformation.

The first option, you may propose, is to convert every character to their numeric order in the alphabet. For instance, "a" is 0, "b" is 1, "c" is 2 and so on. Thus, we will have 26 possible values for every character. It means, the word "car" can be represented as [2, 0, 17]. In this case, if your task is to classify text with many sentences, your input dimensionality becomes 2D, which isn't fine, because input should be 1D. We might flatten the 2D array, but then it gets delicate. It means, the text like this "I want apples" converts to "iwantapples"(and then to a numeric 1D vector). It may be fine, yet we're not sure the network recognizes a pattern there to classify correctly.

The big problem with such an approach is that every character is seen by a net independently, not as a word. Thus, "car" is [2, 0, 17], and the resulting function(a set of functions that process the input) may "think" it's almost the same as "bar" - [1, 0, 17]. It doesn't think, of course, but the pattern says so. Thus, it's difficult to retrieve any context, we just perceive every character independently.

The second option is to do the same, but for words. In reality, we retrieve context mainly from words, not by characters separately. Such an approach also simplifies the calculations: we don't need to convert 2D input into 1D and a neural network gets fewer numbers to process, which is a performance boost. To convert words to a number, we should figure out what numbers to assign to them. You may create examples of text you will be training on, tokenize it into words(omitting punctuation because it doesn't add context), make a dictionary of these words, where every one of them gets an ordering number. It's like adding words to a Set and their number is an order in which they appear in it. E.g. if I have a text "I want apples.", my dictionary is ["i", "want", "apples"], where the word "I" will be assigned to 0, "want" to 1, and "apples" to 2.

We may optimize this approach by also stemming words to their root form, e.g. "apples" become "apple" because the net doesn't need to know(except the use cases where your task is to classify singular or plural forms) whether it's a singular or a plural form, it's better to have a numeric representation for a word abstraction - apples("apple", "apples").

It's the most simple method to vectorize text. Though, it also has problems. In cases, where you need your neural net to "figure out" the context by looking for a set of words, it's difficult because in the example above, "I" and "want" are placed as neighbors (0 and 1 accordingly), but they aren't similar, they mean different things. For example, "car" and "automobile" mean the same but can be represented as 14 and 8233 with this approach. Thus, your model may derive different results based on whether your examples have synonyms.

The third option is to use pre-generated vectors. The ones that were generated by processing a lot of texts and deriving which words are similar, and which ones are different. Thus, for example, a vector for "car" may be [0.45, 0.78, 0.97, 0.34, 0.87], and for "automobile" it may be [0.49, 0.73, 0.98, 0.33, 0.88]. As you noticed, they're not single numbers, but vectors for every word. Thus, you get a 2D array for the whole text. I'd suggest you go with pre-generated vectors such as GloVe.

Getting back to Brain.js

Now you know how to convert strings to vectors, you can use the library to help you. It has various types of pre-defined neural networks. The one we saw before is the feedforward neural net with backpropagation. This is where things get delicate too, again - in choosing the right network type. A feedforward net is a simple one that takes an input, does some calculations-transformations, and returns the results. It sees every input independently, it doesn't have a memory. It means it can't derive context from multiple words. If your task requires so, you better choose recurring neural nets such as RNN or LSTM(see the Brain.js details on them).

TensorFlow.js

This is a path where you decided you require more custom setup. This is a Javascript version of a powerful machine learning framework for Python. It allows you to build any models or use already created ones by the community. However, they don't have much. And their functionality of converting Python models to JS ones and vice versa doesn't work well enough yet.

The code may look like this:

const tf = require('@tensorflow/tfjs-node');

const data = {
    // assume we already have vector representations of the text examples
    inputs: vectorRepresentations,
    // imagine we have such 3 classes
    output: [0, 0, 2, 1, 2, 1, 0, 1],
}

// tensors are TensorFlow vectors to simplify the internal
// processing for the library
const inputTensors = tf.tensor(data.inputs);
const outputTensors = tf.tensor(data.outputs);

const model = tf.sequential();

// 1st layer: a 1d convolutional network
model.add(tf.layers.conv1d({
	filters: 100,
	kernelSize: 3,
	strides: 1,
	activation: 'relu',
	padding: 'valid',
	inputShape: [MAX_WORDS_LENGTH, GLOVE_VECTOR_DIMENSIONS],
}));

// transform 2d input into 1d
model.add(tf.layers.globalMaxPool1d({}));

// the final layer with one neuron
model.add(tf.layers.dense({ units: 1, activation: 'sigmoid' }));

// here are some tuning, read in the TF docs for more
model.compile({
    optimizer: tf.train.adam(LEARNING_RATE),
    loss: 'binaryCrossentropy',
    metrics: ['accuracy'],
});

// print the model architecture
model.summary();

// train the model
await model.fit(inputs, answers, {
    // the default size, how many inputs to process per time
    batchSize: 32,
    
    // how many times to "process", simply put
    epochs: EPOCHS,
    
    // the fraction of the inputs to be in the validation set:
    // the set, which isn't trained on, but participates in calculating
    // the model's metrics such as accuracy and loss
    validationSplit: 0.2,
    
    // shuffle inputs randomly to have a different starting seed every time
    shuffle: true,
});

// save the model to load in the future and run classifications
await model.save('file://./data/models/myFirstModel');

Here we built a model to do text classification for 3 pseudo-classes(0, 1, 2). We used a 1d convolutional network for the 1st layer. TensorFlow allows you to specify any amount of layers you want, set training epochs, validation split, choose different ML algorithms, activation functions for every layer, and many other options. Though, we need to know how to build ML models. If we don't, we may add anything, tune parameters, and won't receive good results.

I went to TensorFlow.js for more customizability but spent months on adjusting a lot of stuff and didn't get great results. I learned many things along the way, but still, I'm not an ML engineer, so it's better(faster) to use models built by professionals and not create your own wheel. But if it's for fun, why not! Then, let's understand the code I wrote.

I chose this architecture due to its performance: convolutional networks are faster for text processing and also they process input in a kind of context. They're mainly used in computer vision because they process input matrices, not just 1d arrays of numbers. So, for example, if you get an image, 100x100 px, a convolutional network may process 5x5 pixel window per time. Thus, some noise and details can be classified correctly. For text, it's almost the same - we need to take multiple words in a batch and don't process them independently. Thus, simplifying a model's job in recognizing patterns.

I chose GloVe vector representations, so my input data was a 2D array of numbers, where every subarray was a word representation. The kernelSize parameter in a convolutional network is responsible for the "sliding window" - those 5x5 pixels to process per time. In my case, I specified kernelSize to 3. It means the network processes 3 vectors(3 words) per time. The filters param tells how many neurons you want. strides means how many "steps" to take per once when moving the "sliding window". For example, for the text "I want to eat apples tomorrow", the first batch is ["i", "want", "to"], the second batch is ["want", "to", "eat"], the 3rd is ["to", "eat", "apples"] , and so on. So, it moves by one word per time to the right.

General learnings

I spent some time with Natural.js, then Brain.js, and TensorFlow. I went to the last one for custom configuration and spent a lot of time building custom models. It'd be better to use an already built model for text classification. However, I didn't find a good way to transform Python TensorFlow models into Javascript ones, that's why in the end I switched to a Python setup with HuggingFace. But my task wasn't so straightforward. I wanted to classify potential people's problems and pains: where someone hates using something or complaining about things.

There were some things I learned while building custom models with tensorFlow.js I wish I knew earlier. Write your experiments in a logbook. You'll be building various models with various hyperparameters and it becomes difficult to recall what worked well for you and what didn't. Also, don't forget about the test set(assuming you have a validation one too).

There is a lot of things to mention about building ML models. Here are some I highlighted in my logbook. I hope it saves someone's time in narrowing the search when troubleshooting.

When to stop training. If a validation loss starts increasing. It should be similar to but slightly higher than a training loss. If it's lower or almost equal to a training loss, a model requires more training. If training loss is reducing without an increase in validation loss then again keep doing more training.

You have 1.0 accuracy. In most cases, if you have 100% train accuracy, you've probably greatly overfitted. Or, a model recognized a "false" pattern in your data.

Overfitting? A big topic. Here's some reference(not mine, but I can't find the source):

If validation loss >> training loss you can call it overfitting.
If validation loss  > training loss you can call it some overfitting.
If validation loss  < training loss you can call it some underfitting.
If validation loss << training loss you can call it underfitting.

A higher than training loss validation loss means overfitting, a model learned the patterns that happen to be true in training data, but they're not in real-world data.

If you have a too powerful model(the one that has too many parameters and not much training data, see model.summary()), review it again and simplify, because for me some of the models memorized the data, and thus, have greatly overfitted.

Another evidence of overfitting is that your loss is increasing, Loss is measured more precisely, it's more sensitive to the noisy prediction if it's not squashed by sigmoids/thresholds (which seems to be your case for the Loss itself). Intuitively, you can imagine a situation when the network is too sure about output (when it's wrong), so it gives a value far away from the threshold in case of random misclassification.

Accuracy or loss fluctuates.

some portion of examples is classified randomly, which produces fluctuations, as the number of correct random guesses always fluctuates (imagine accuracy when coin should always return "heads"). Basically, sensitivity to noise (when classification produces random results) is a common definition of overfitting
The training loss at each epoch is usually computed on the entire training set.  The validation loss at each epoch is usually computed on one minibatch of the validation set, so it is normal for it to be noisier.

Take care of your batch size. Sometimes it needs to be adjusted:

Smaller batch sizes give noisy gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower and increase computational complexity.

Enjoying these posts? Subscribe for more