Text processing
It is possible to do simple text processing using only the standard Java library with classes such as StringTokenizer, the java.text package, or the regular expressions.
In addition to that, there is a big variety of text processing frameworks available for Java as follows:
- Apache Lucene (https://lucene.apache.org/) is a library that is used for information retrieval
- Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/)
- Apache OpenNLP (https://opennlp.apache.org/)
- LingPipe (http://alias-i.com/lingpipe/)
- GATE (https://gate.ac.uk/)
- MALLET (http://mallet.cs.umass.edu/)
- Smile (http://haifengl.github.io/smile/) also has some algorithms for NLP
Most NLP libraries have very similar functionality and coverage of algorithms, which is why selecting which one to use is usually a matter of habit or taste. They all typically have tokenization, parsing, part-of-speech tagging, named entity recognition, and other algorithms for text processing. Some of them (such as StanfordNLP) support multiple languages, and some support only English.
We will cover some of these libraries in Chapter 6, Working with Text - Natural Language Processing and Information Retrival.