jTokeniser

A free software solution that combines a set of tokenisers that deal intuitively with natural language
Download

jTokeniser Ranking & Summary

Advertisement

  • Rating:
  • License:
  • GPL
  • Publisher Name:
  • Andy Roberts
  • Operating Systems:
  • Windows All / Unix
  • File Size:
  • 83 KB

jTokeniser Tags


jTokeniser Description

Tokenising strings into its constituent tokens/words can prove tricky for non-trivial examples. In particular, when you're dealing with natural language, you must take into consideration punctuation too in order to isolate the words. Each of the tokenisers adopt a similar structure to java.util.StringTokenizer in terms of how to instantiate the classes and extract the tokens. This means they are simple to use. You can type in, copy and paste, or even load a text file into the application. You must select your tokeniser of choice (and any options of interest) and then hit the Tokenise button. Your results will be displayed as soon as they are processed and you have the option to save the results to file, if you choose. The GUI is particularly useful for experimenting with tokenisation methods in a teaching environment (such as an NLP course). It will also be of interest to those wishing to use the jTokeniser library but don't have the Java programming experience to utilise the code directly. jTokeniser comprises of four tokenisers that all extend from an abtract Tokeniser class: · WhiteSpaceTokeniser - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds. · StringTokeniser - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser, however, you can specify a set of characters that are to be used to indicate word delimiters. · RegexTokeniser - this tokeniser is much more flexible as you can use regular expressions to define a what a token is. So, "\w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokeniser. · RegexSeparatorTokeniser - this can be thought of as an advanced StringTokeniser. Whereas StringTokeniser is limited to defining delimiters as a set of individual characters, RegexSeparatorTokeniser can utilise regular expressions for a richer and more flexible approach. · BreakIteratorTokeniser - one of the most sophisticated tokenisers in the library, although should only be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc. · SentenceTokeniser - this also uses a BreakIterater like the above, but tuned towards finding sentence boundaries. The "tokens" in this tokeniser are in fact individual sentences.


jTokeniser Related Software