It seems like a lot of people are using micro services that are hosted by a third party vendor to process various bits of natural language processing and other tasks that typically get lumped into the “Machine Learning” bucket. But there are a lot of environments that need self-contained systems. Especially when it’s simple to build a small project to do a task using something like Python’s nltk, but when you start hammering those microservices you build with those tools you start to feel how their ease-of-use can make them a bit more sluggish than purpose-built tooling (I guess that’s true for everything in software development).
So let’s look at what additional frameworks can be put into your project or containerized into your project:
- Apache Spark is a big data processor built on Hadoop. MLlib is a Scala machine learning library . Written in Scala, MLib can be used in Java, Python, and R projects and can connect right into Hadoop for unlimited processing of sources, models, and files. This is probably the easiest way to get started with the basic machine learning tasks we typically assign to software development projects in the beginning, like classification, clustering, collaborative filtering, reducing dimensions, etc.
- Apache Jena is one of the better frameworks for serialization and visualization. The text search via SPARQL will look similar(ish) to those already familiar with standard SQL calls. Jena has pretty expansive inference rules and using Fuseki it’s fairly straight-forward to interact with standard REST endpoints. It’s not simple but if you’re doing larger projects the learning curve can be worth it, especially if you’re happy with the Elasticsearch stack.
- TensorFlow for Java is one aspect of the incredibly popular TensorFlow stack. It’s easy to build models and deploy. There are lots of pre-trained models, there’s an easy migration to larger solutions, but there’s a lot you don’t need. It was built by Google Brain in 2011 and is both mature and scalable, but the scalable part requires more skill.
- RapidMiner has one of the best communities and a lot of documentation. The Java API is solid and it also covers a lot of machine learning algorithms. The Radoop offering is similar to what you might do with the above Apache tools, but with a commercial offering. If the pricing makes sense to you then it’s a good option, given the amount that it reduces development time.
- Massive Online Analysis is purpose-built specifically for performing machine learning and data mining on larger data streams in real time. If you’re data mining, definitely check out MOA.
- MALLET does many of the other generic tasks like classification, modeling, anomaly detection. But it shines at topic modeling, Markov, and a few branches of natural language processing. The GRMM add-on adds more graphic modeling options (thus the G)
- ELKI is great data mining, especially for k-means and k-nearest neighbor (KNN) – common in trying to find objects similar to one another given as much metadata as we can get, like o-day malware detection or finding the next song you would want to listen to.
- The Java Machine Learning Library is known generically as Java-ML and it comes with items similar to python’s nltk. It hasn’t been updated in awhile but consider this: much of the things we think of as machine learning today started back in the 1960s, so provided it can still compile and the APIs are documented, it’s not a huge issue.
- Weka comes with a number of algorithms that can be quickly applied to a number of datasets. You can use it as a stand-alone GUI tool and pull in aspects through Java APIs. It too has a fairly large community and lots of good training materials. I find one of the harder things to do is to get data in a place where it can be analyzed. Weka has some tooling to assist with that as well as training models. And everything in the GUI tools is available through a Java API (or at least everything I’ve tried).
- Keras, implemented via DL4J, is a Deep Learning library 4 a JVM. It’s built to work with Eclipse, so one of the more native Java experiences. It also has plenty of APIs for data normalization, one of the more difficult things with larger projects IMHO.
- Another Java machine learning framework is Encong, which also provides a number of machine learning algorithms. I haven’t used it much so I won’t go into detail but a number of people have told me I should use it more.
As you get further down the path, check out Gluon. It’s a collaboration between Microsoft and Amazon. And one of the most challenging aspect of machine learning projects is keeping the models up-to-date. The AWS Labs Deep Java Library is open source and can be run without being connected to AWS if needed and the Java API is straight-forward to implement for testing various weird learning models you come up with. And there are a lot of bits and pieces about it on the GitHubs. You can also take projects from these libraries and bring them into SageMaker, MXNet, Transcribe, Lex, or Rekognition in the AWS ecosystem.
In general, the first step to any machine learning project is to try and decide exactly what you want to do. It’s all pretty advanced statistical analysis and while we’re all unique, chances are that what you’re trying to accomplish has been done before and there’s a library or some sample code out there to get you started. If you want to recommend tags, search for nearest neighbors, perform sentiment analysis, classify content, or go much deeper, you’re not alone.