Package weka.core.tokenizers
Class CharacterNGramTokenizer
java.lang.Object
weka.core.tokenizers.Tokenizer
weka.core.tokenizers.CharacterNGramTokenizer
- All Implemented Interfaces:
Serializable,Enumeration<String>,OptionHandler,RevisionHandler
Splits a string into an n-gram with min and max
grams.
Valid options are:
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Version:
- $Revision: 10971 $
- Author:
- Sebastian Germesin (sebastian.germesin@dfki.de), Eibe Frank (eibe@cs.waikato.ac.nz)
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionintGets the max N of the NGram.intGets the min N of the NGram.String[]Gets the current option settings for the OptionHandler.Returns the revision string.Returns a string describing the tokenizerbooleanreturns true if there's more elements availableReturns an enumeration of all the available options..static voidRuns the tokenizer with the given options and strings to tokenize.Returns N-grams and also (N-1)-grams and ....Returns the tip text for this property.Returns the tip text for this property.voidsetNGramMaxSize(int value) Sets the max size of the Ngram.voidsetNGramMinSize(int value) Sets the min size of the Ngram.voidsetOptions(String[] options) Parses a given list of options.voidSets the string to tokenize.Methods inherited from class weka.core.tokenizers.Tokenizer
runTokenizer, tokenizeMethods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface java.util.Enumeration
asIterator
-
Constructor Details
-
CharacterNGramTokenizer
public CharacterNGramTokenizer()
-
-
Method Details
-
globalInfo
Returns a string describing the tokenizer- Specified by:
globalInfoin classTokenizer- Returns:
- a description suitable for displaying in the explorer/experimenter GUI
-
listOptions
Returns an enumeration of all the available options..- Specified by:
listOptionsin interfaceOptionHandler- Overrides:
listOptionsin classTokenizer- Returns:
- an enumeration of all available options.
-
getOptions
Gets the current option settings for the OptionHandler.- Specified by:
getOptionsin interfaceOptionHandler- Overrides:
getOptionsin classTokenizer- Returns:
- the list of current option settings as an array of strings
-
setOptions
Parses a given list of options. Valid options are:-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Specified by:
setOptionsin interfaceOptionHandler- Overrides:
setOptionsin classTokenizer- Parameters:
options- the list of options as an array of strings- Throws:
Exception- if an option is not supported
-
getNGramMaxSize
public int getNGramMaxSize()Gets the max N of the NGram.- Returns:
- the size (N) of the NGram.
-
setNGramMaxSize
public void setNGramMaxSize(int value) Sets the max size of the Ngram.- Parameters:
value- the size of the NGram.
-
NGramMaxSizeTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNGramMinSize
public void setNGramMinSize(int value) Sets the min size of the Ngram.- Parameters:
value- the size of the NGram.
-
getNGramMinSize
public int getNGramMinSize()Gets the min N of the NGram.- Returns:
- the size (N) of the NGram.
-
NGramMinSizeTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
hasMoreElements
public boolean hasMoreElements()returns true if there's more elements available- Specified by:
hasMoreElementsin interfaceEnumeration<String>- Specified by:
hasMoreElementsin classTokenizer- Returns:
- true if there are more elements available
-
nextElement
Returns N-grams and also (N-1)-grams and ....- Specified by:
nextElementin interfaceEnumeration<String>- Specified by:
nextElementin classTokenizer- Returns:
- the next element
-
tokenize
Sets the string to tokenize. -
getRevision
Returns the revision string.- Returns:
- the revision
-
main
Runs the tokenizer with the given options and strings to tokenize. The tokens are printed to stdout.- Parameters:
args- the commandline options and strings to tokenize
-