Package provides java implementation of various text preprocessing methods such as tokenizers, vocabulary, text filter, stemmer, and so on
Add the following dependency to your POM file:
<dependency> <groupId>com.github.chen0040</groupId> <artifactId>java-data-text</artifactId> <version>1.0.3</version> </dependency>-
Porter Stemmer
-
Punctuation Filter
-
Stop Word Removal
- Xml Tag Removal
- Ip Address Removal
- Number Removal
-
English Tokenizer
To use any text filter, just create a new text filter and then calls its filter(...) method.
import com.github.chen0040.data.text.TextFilter; import com.github.chen0040.data.text.PorterStemmer; TextFilter stemmer = new PorterStemmer(); List<String> words = Arrays.asList( "caresses", "ponies", "ties", "caress", "cats", "feed", "agreed", "disabled", "matting", "mating", "meeting", "milling", "messing", "meetings" ); List<String> result = stemmer.filter(words); for (int i = 0; i < words.size(); ++i) { System.out.println(String.format("%s -> %s", words.get(i), result.get(i))); }import com.github.chen0040.data.text.TextFilter; import com.github.chen0040.data.text.StopWordRemoval; StopWordRemoval filter = new StopWordRemoval(); filter.setRemoveNumbers(false); filter.setRemoveIpAddress(false); filter.setRemoveXmlTag(false); InputStream inputStream = FileUtils.getResource("documents.txt"); BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream)); String content = reader.lines().collect(Collectors.joining("\n")); reader.close(); List<String> before = BasicTokenizer.doTokenize(content); List<String> after = filter.filter(before);import com.github.chen0040.data.text.TextFilter; import com.github.chen0040.data.text.PunctuationFilter; TextFilter filter = new PunctuationFilter(); InputStream inputStream = FileUtils.getResource("documents.txt"); BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream)); String content = reader.lines().collect(Collectors.joining("\n")); reader.close(); List<String> before = BasicTokenizer.doTokenize(content); List<String> after = filter.filter(before);