Sunday, December 14, 2025

Creating a Naive Bayes Textual content Classifier in JAVA


In earlier articles we have now mentioned the theoretical background of Naive Bayes Textual content Classifier and the significance of utilizing Characteristic Choice methods in Textual content Classification. On this article, we’re going to put every thing collectively and construct a easy implementation of the Naive Bayes textual content classification algorithm in JAVA. The code of the classifier is open-sourced (underneath GPL v3 license) and you’ll obtain it from Github.

Replace: The Datumbox Machine Studying Framework is now open-source and free to obtain. Try the bundle com.datumbox.framework.machinelearning.classification to see the implementation of Naive Bayes Classifier in Java.

Naive Bayes Java Implementation

The code is written in JAVA and will be downloaded straight from Github. It’s licensed underneath GPLv3 so be happy to make use of it, modify it and redistribute it freely.

The Textual content Classifier implements the Multinomial Naive Bayes mannequin together with the Chisquare Characteristic Choice algorithm. All of the theoretical particulars of how each methods work are coated in earlier articles and detailed javadoc feedback will be discovered on the supply code describing the implementation. Thus on this section I’ll give attention to a excessive degree description of the structure of the classifier.

1. NaiveBayes Class

That is the principle a part of the Textual content Classifier. It implements strategies corresponding to practice() and predict() that are chargeable for coaching a classifier and utilizing it for predictions. It ought to be famous that this class can also be chargeable for calling the suitable exterior strategies to preprocess and tokenize the doc earlier than coaching/prediction.

2. NaiveBayesKnowledgeBase Object

The output of coaching is a NaiveBayesKnowledgeBase Object which shops all the required data and chances which are utilized by the Naive Bayes Classifier.

3. Doc Object

Each the coaching and the prediction texts within the implementation are internally saved as Doc Objects. The Doc Object shops all of the tokens (phrases) of the doc, their statistics and the goal classification of the doc.

4. FeatureStats Object

The FeatureStats Object shops a number of statistics which are generated through the Characteristic Extraction section. Such statistics are the Joint counts of Options and Class (from which the joint chances and likelihoods are estimated), the Class counts (from which the priors are evaluated if none are given as enter) and the whole variety of observations used for coaching.

5. FeatureExtraction Class

That is the category which is chargeable for performing function extraction. It ought to be famous that since this class calculates internally a number of of the statistics which are truly required by the classification algorithm within the later stage, all these stats are cached and returned in a FeatureStats Object to keep away from their recalculation.

6. TextTokenizer Class

This can be a easy textual content tokenization class, chargeable for preprocessing, clearing and tokenizing the unique texts and changing them into Doc objects.

Utilizing the NaiveBayes JAVA Class

Within the NaiveBayesExample class you’ll find examples of utilizing the NaiveBayes Class. The goal of the pattern code is to current an instance which trains a easy Naive Bayes Classifier to be able to detect the Language of a textual content. To coach the classifier, initially we offer the paths of the coaching datasets in a HashMap after which we load their contents.


   //map of dataset information
   Map trainingFiles = new HashMap<>();
   trainingFiles.put("English", NaiveBayesExample.class.getResource("/datasets/coaching.language.en.txt"));
   trainingFiles.put("French", NaiveBayesExample.class.getResource("/datasets/coaching.language.fr.txt"));
   trainingFiles.put("German", NaiveBayesExample.class.getResource("/datasets/coaching.language.de.txt"));

   //loading examples in reminiscence
   Map trainingExamples = new HashMap<>();
   for(Map.Entry entry : trainingFiles.entrySet()) {
      trainingExamples.put(entry.getKey(), readLines(entry.getValue()));
   }

The NaiveBayes classifier is skilled by passing to it the info. As soon as the coaching is accomplished the NaiveBayesKnowledgeBase Object is saved for later use.


   //practice classifier
   NaiveBayes nb = new NaiveBayes();
   nb.setChisquareCriticalValue(6.63); //0.01 pvalue
   nb.practice(trainingExamples);
      
   //get skilled classifier
   NaiveBayesKnowledgeBase knowledgeBase = nb.getKnowledgeBase();

Lastly to make use of the classifier and predict the lessons of recent examples all it’s good to do is initialize a brand new classifier by passing the NaiveBayesKnowledgeBase Object which you acquired earlier by coaching. Then by calling merely the predict() technique you get the expected class of the doc.


   //Take a look at classifier
   nb = new NaiveBayes(knowledgeBase);
   String exampleEn = "I'm English";
   String outputEn = nb.predict(exampleEn);
   System.out.format("The sentense "%s" was categorised as "%s".%n", exampleEn, outputEn);   

Essential Expansions

The actual JAVA implementation shouldn’t be thought-about a whole prepared to make use of answer for classy textual content classification issues. Listed here are among the necessary expansions that might be carried out:

1. Key phrase Extraction:

Although utilizing single key phrases will be ample for easy issues corresponding to Language Detection, different extra difficult issues require the extraction of n-grams. Thus one can both implement a extra subtle textual content extraction algorithm by updating the TextTokenizer.extractKeywords() technique or use Datumbox’s Key phrase Extraction API perform to get all of the n-grams (key phrase mixtures) of the doc.

2. Textual content Preprocessing:

Earlier than utilizing a classifier normally it’s essential to preprocess the doc to be able to take away pointless characters/components. Although the present implementation performs restricted preprocessing through the use of the TextTokenizer.preprocess() technique, relating to analyzing HTML pages issues turn into trickier. One can merely trim out the HTML tags and preserve solely the plain textual content of the doc or resort to extra sophisticate Machine Studying methods that detect the principle textual content of the web page and take away content material which belongs to footer, headers, menus and so forth. For the later you should use Datumbox’s Textual content Extraction API perform.

3. Further Naive Bayes Fashions:

The present classifier implements the Multinomial Naive Bayes classifier, nonetheless as we mentioned in a earlier article about Sentiment Evaluation, totally different classification issues require totally different fashions. In some a Binarized model of the algorithm could be extra acceptable, whereas in others the Bernoulli Mannequin will present a lot better outcomes. Use this implementation as a place to begin and comply with the directions of the Naive Bayes Tutorial to increase the mannequin.

4. Further Characteristic Choice Strategies:

This implementation makes use of the Chisquare function choice algorithm to pick out essentially the most acceptable options for the classification. As we noticed in a earlier article, the Chisquare function choice technique is an effective method which relays on statistics to pick out the suitable options, nonetheless it tends to present greater scores on uncommon options that solely seem in one of many classes. Enhancements will be made eradicating noisy/uncommon options earlier than continuing to function choice or by implementing extra strategies such because the Mutual Data that we mentioned on the aforementioned article.

5. Efficiency Optimization:

Within the specific implementation it was necessary to enhance the readability of the code quite than performing micro-optimizations on the code. Although such optimizations make the code uglier and tougher to learn/preserve, they’re typically needed since many loops on this algorithm are executed thousands and thousands of instances throughout coaching and testing. This implementation generally is a nice place to begin for growing your personal tuned model.

Virtually there… Last Notes!

I-heard-hes-good-at-coding-lTo get a very good understanding of how this implementation works you might be strongly suggested to learn the 2 earlier articles about Naive Bayes Classifier and Characteristic Choice. You’ll get insights on the theoretical background of the strategies and it’ll make components of the algorithm/code clearer.

We must always observe that Naive Bayes regardless of being a simple, quick and many of the instances “fairly correct”, additionally it is “Naive” as a result of it makes the idea of conditional independence of the options. Since this assumption is nearly by no means met in Textual content Classification issues, the Naive Bayes is nearly by no means the very best performing classifier. In Datumbox API, some expansions of the usual Naive Bayes classifier are used solely for easy issues corresponding to Language Detection. For extra difficult textual content classification issues extra superior methods such because the Max Entropy classifier are needed.

If you happen to use the implementation in an fascinating mission drop us a line and we are going to function your mission on our weblog. Additionally if you happen to just like the article please take a second and share it on Twitter or Fb. 🙂

Related Articles

Latest Articles