Saturday, November 29, 2025

New open-source Machine Studying Framework written in Java


I’m blissful to announce that the Datumbox Machine Studying Framework is now open sourced below GPL 3.0 and you may obtain its code from Github!

What is that this Framework?

The Datumbox Machine Studying Framework is an open-source framework written in Java which permits the speedy growth of Machine Studying fashions and Statistical functions. It’s the code that presently powers up the Datumbox API. The principle focus of the framework is to incorporate numerous machine studying algorithms & statistical strategies and be capable to deal with small-medium sized datasets. Though the framework targets to help the event of fashions from varied fields, it additionally supplies instruments which might be notably helpful in Pure Language Processing and Textual content Evaluation functions.

What varieties of fashions/algorithms are supported?

The framework is split in a number of Layers corresponding to Machine Studying, Statistics, Arithmetic, Algorithms and Utilities. Every of them supplies a collection of courses which might be used for coaching machine studying fashions. The 2 most essential layers are the Statistics and the Machine Studying layer.

The Statistics layer supplies courses for calculating descriptive statistics, performing varied varieties of sampling, estimating CDFs and PDFs from generally used likelihood distributions and performing over 35 parametric and non-parametric checks. Such varieties of courses are often obligatory whereas performing explanatory knowledge evaluation, sampling and have choice.

The Machine Studying layer supplies courses can be utilized in numerous issues together with Classification, Regression, Cluster Evaluation, Matter Modeling, Dimensionality Discount, Function Choice, Ensemble Studying and Recommender Methods. Listed below are a number of the supported algorithms: LDA, Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Course of Combination Fashions, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and extra.

Datumbox Framework VS Mahout VS Scikit-Study

Each Mahout and Scikit-Study are nice tasks and each of them have utterly totally different targets. Mahout helps solely a really restricted variety of algorithms which may be parallelized and thus use Hadoop’s Map-Cut back framework to deal with Large Information. However Scikit-Study helps numerous algorithms however it may’t deal with large quantity of knowledge. Furthermore it’s developed in Python, which is a superb language for prototyping and Scientific Computing however not my private favorite for software program growth.

The Datumbox Framework sits in the course of the 2 options. It tries to assist numerous algorithms and it’s written in Java. Which means it may be included simpler into manufacturing code, it may simpler be tweaked to scale back reminiscence consumption and it may be utilized in actual time techniques. Lastly although presently Datumbox Framework is able to dealing with medium-sized datasets, it’s inside my plans to broaden it to deal with large-sized datasets.

How steady is it?

The early variations of the framework (as much as 0.3.x) have been developed in August and September of 2013 and so they have been written in PHP (yeap!). Throughout Could and June 2014 (variations 0.4.x), the framework was rewritten in Java and enhanced with further options. Each branches have been closely examined in business functions together with the Datumbox API. The present model is 0.5.0 and it appears mature sufficient to be launched as the primary public alpha model of the framework. Having stated that, you will need to word that some functionalities of the framework are examined extra completely than others. Furthermore since this model is alpha, you must count on drastic adjustments on the longer term releases.

Why I wrote it and why I open-source it?

My involvement with Machine Studying and NLP dates again to 2009 after I co-founded WebSEOAnalytics.com. Since then I’ve been creating implementations of assorted machine studying algorithms for varied tasks and functions. Sadly a lot of the unique implementations have been very problem-specific and so they might hardly be utilized in every other downside. In August 2013 I made a decision to begin Datumbox as a private undertaking and develop a framework that gives the instruments for creating machine studying fashions focusing within the space of NLP and Textual content Classification. My goal was to construct a framework that will be reused on the longer term for creating shortly machine studying fashions, incorporating it in tasks that require machine studying parts or provide it as a service (Machine Studying as a Service).

And right here I’m now, a number of strains of code later, open-sourcing the undertaking. Why? The trustworthy reply is that at this level, it isn’t inside my plans to undergo a “let’s construct a brand new start-up” journey. On the identical time I felt that preserving the code on my onerous disk in case I would like it on the longer term doesn’t make sense. So the one logical factor to do was to open-source it. 🙂

Documentation?

For those who learn the earlier two paragraphs, you must in all probability seen this coming. For the reason that framework was not developed having in thoughts that I might share it with others, the documentation is poor/non-existent. Many of the courses and public strategies should not correctly commented and there’s no doc describing the structure of the code. Thankfully all the category names are self-explanatory and the framework supplies JUnit checks for each public methodology & algorithm and these can be utilized as examples of how you can use the code. I hope that with the assistance of the group we’ll construct a correct documentation, so I’m relying on you!

Present Limitations and Future Improvement

As in each piece of software program (and particularly the open-source tasks in alpha model), the Datumbox Machine Studying Framework comes with its personal distinctive and lovely limitations. Let’s dig into them:

  1. Documentation: As talked about earlier, the documentation is poor.
  2. No Multithreading: Sadly the framework doesn’t presently assist Multithreading. In fact we should always word that not all machine studying algorithms may be parallelized.
  3. Code Examples: For the reason that framework has simply been printed, you possibly can’t discover any code examples on the internet aside from these offered by the framework within the type of JUnit checks.
  4. Code Construction: Making a stable structure for any massive undertaking is at all times difficult, not to mention when you must take care of Machine Studying algorithms that differ considerably (supervised studying, unsupervised studying, dimensionality discount algorithms and many others).
  5. Mannequin Persistence and Massive Information Collections: Presently the fashions may be skilled and saved both on recordsdata on disk or in MongoDB databases. To have the ability to deal with great amount of knowledge, different options have to be investigated. For instance MapDB looks like a superb candidate for storing knowledge and parameters whereas coaching. Furthermore you will need to take away any 3rd celebration libraries that presently deal with the persistence of the fashions and develop a greater dry and modular resolution.
  6. New algorithms/checks/fashions: There are such a lot of nice strategies that aren’t presently supported (particularly for time collection evaluation).

Sadly all of the above are an excessive amount of work and there may be so little time. That’s the reason in case you are within the undertaking, step ahead and provides me a hand with any of the above. Furthermore I might love to listen to from individuals who have expertise in open-sourcing medium-large tasks and will present any recommendations on how you can handle them. Moreover I might be grateful to any courageous soul who would dare to look into the code and doc some courses or public strategies. Final however not least if you happen to use the framework for something attention-grabbing, please drop me a line or share it with a weblog put up.

 

Lastly I wish to thank my love Kyriaki for tolerating me whereas scripting this undertaking, my good friend and super-ninja-Java-developer Eleftherios Bampaletakis for serving to out with essential Java points and also you for getting concerned within the undertaking. I’m trying ahead to your feedback.

Related Articles

Latest Articles