Utilizing Apache OpenNLP with Spring Boot: A Practical Guide
Written on
Introduction to Apache OpenNLP
Apache OpenNLP is a Java-based library leveraging machine learning for processing natural language text. It operates under the Apache License Version 2.0 and supports various NLP functions, including Tokenization, Sentence Segmentation, Part of Speech Tagging, and Sentiment Analysis. For an in-depth understanding of its capabilities, visit the official documentation at opennlp.apache.org.
In this article, we'll explore three specific use cases of Apache OpenNLP. Before diving into the examples, it’s essential to recognize that machine learning (ML) algorithms require a set of training data, which can either be user-generated or provided by the library itself.
We will cover three distinct cases: one that doesn't require training data, another that relies on existing data from Apache OpenNLP, and the final one, where we will create our own training data.
For use cases that utilize trained data provided by OpenNLP, you can download it from http://opennlp.sourceforge.net/models-1.5/. The training data is available in multiple languages, so ensure that you select the correct language—this article focuses on English, indicated by the prefix "en" in the downloaded model.
Integrating OpenNLP in a Spring Boot Application
To incorporate Apache OpenNLP into a Maven-based Spring Boot application, add the following dependency in your pom.xml file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>2.0.0</version>
</dependency>
Case 1: Tokenization
Tokenization represents one of the foundational steps in natural language processing. This technique divides text into smaller components, which helps in understanding the context of the content. Apache OpenNLP offers an API through the SimpleTokenizer class, which can be used as follows:
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String tokens[] = tokenizer.tokenize("Your Service is bad");
The resulting tokens from this code snippet would be [Your, Service, is, bad].
Case 2: Part of Speech Tagging
For ML algorithms to function effectively, reliable training data is essential. This use case employs a pre-trained model known as en-pos-maxent.bin. Through Part of Speech (POS) Tagging, we can identify various parts of speech such as nouns, verbs, and adjectives in a given text. This method is particularly useful in applications like sentiment analysis, where specific parts of speech are analyzed further.
Here’s a simple Java method to extract ADVERB, ADJECTIVE, and VERB from a text:
// Example method to extract parts of speech
If we analyze the text "Your Service is bad" with this method, the POS Tagger would label the tokens as:
Your_PRP$ Service_NN is_VBZ bad_JJ
Case 3: Sentiment Analysis / Document Categorization
The Sentiment Analyzer or Document Categorizer can be utilized to evaluate sentiments in various texts, such as product reviews or tweets. Unlike other use cases, this one requires users to generate their own training data, as Apache OpenNLP does not supply a pre-trained model for this purpose.
To create a training model, you’ll need to compile a file with training data and save it with a .txt extension. For example, the data might categorize sentiments as Angry or Neutral. Any input processed against this model will be classified into one of these categories or default to the first category (Neutral) if it can't be decisively categorized.
After preparing the training data, you will generate a model file with a .bin extension using the following Java program:
// Example program to create a training model
Next, we utilize the created model to categorize an input:
File file = ResourceUtils.getFile("src/main/resources/nlp-model/en-trained-model.bin");
InputStream in = new FileInputStream(file);
DoccatModel m = new DoccatModel(in);
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
String[] inputText = input.split(" ");
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestCategory(outcomes);
For an input text like “I will not recommend this product,” the output category would be “Angry.”
Conclusion
In summary, we've explored three significant use cases of Apache OpenNLP, including how to utilize existing data models and create custom training data. The complete code for this article is available on GitHub, where you can also run unit test cases to validate the different use cases. If you find this information helpful, consider giving a star to the repository!