Package weka.core.tokenizers
Class NGramTokenizer
java.lang.Object
weka.core.tokenizers.Tokenizer
weka.core.tokenizers.CharacterDelimitedTokenizer
weka.core.tokenizers.NGramTokenizer
- All Implemented Interfaces:
Serializable,Enumeration<String>,OptionHandler,RevisionHandler
Splits a string into an n-gram with min and max
grams.
Valid options are:
-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Version:
- $Revision: 10971 $
- Author:
- Sebastian Germesin (sebastian.germesin@dfki.de), FracPete (fracpete at waikato dot ac dot nz)
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionintGets the max N of the NGram.intGets the min N of the NGram.String[]Gets the current option settings for the OptionHandler.Returns the revision string.Returns a string describing the stemmerbooleanreturns true if there's more elements availableReturns an enumeration of all the available options..static voidRuns the tokenizer with the given options and strings to tokenize.Returns N-grams and also (N-1)-grams and ....Returns the tip text for this property.Returns the tip text for this property.voidsetNGramMaxSize(int value) Sets the max size of the Ngram.voidsetNGramMinSize(int value) Sets the min size of the Ngram.voidsetOptions(String[] options) Parses a given list of options.voidSets the string to tokenize.Methods inherited from class weka.core.tokenizers.CharacterDelimitedTokenizer
delimitersTipText, getDelimiters, setDelimitersMethods inherited from class weka.core.tokenizers.Tokenizer
runTokenizer, tokenizeMethods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface java.util.Enumeration
asIterator
-
Constructor Details
-
NGramTokenizer
public NGramTokenizer()
-
-
Method Details
-
globalInfo
Returns a string describing the stemmer- Specified by:
globalInfoin classTokenizer- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
listOptions
Returns an enumeration of all the available options..- Specified by:
listOptionsin interfaceOptionHandler- Overrides:
listOptionsin classCharacterDelimitedTokenizer- Returns:
- an enumeration of all available options.
-
getOptions
Gets the current option settings for the OptionHandler.- Specified by:
getOptionsin interfaceOptionHandler- Overrides:
getOptionsin classCharacterDelimitedTokenizer- Returns:
- the list of current option settings as an array of strings
-
setOptions
Parses a given list of options. Valid options are:-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Specified by:
setOptionsin interfaceOptionHandler- Overrides:
setOptionsin classCharacterDelimitedTokenizer- Parameters:
options- the list of options as an array of strings- Throws:
Exception- if an option is not supported
-
getNGramMaxSize
public int getNGramMaxSize()Gets the max N of the NGram.- Returns:
- the size (N) of the NGram.
-
setNGramMaxSize
public void setNGramMaxSize(int value) Sets the max size of the Ngram.- Parameters:
value- the size of the NGram.
-
NGramMaxSizeTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNGramMinSize
public void setNGramMinSize(int value) Sets the min size of the Ngram.- Parameters:
value- the size of the NGram.
-
getNGramMinSize
public int getNGramMinSize()Gets the min N of the NGram.- Returns:
- the size (N) of the NGram.
-
NGramMinSizeTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
hasMoreElements
public boolean hasMoreElements()returns true if there's more elements available- Specified by:
hasMoreElementsin interfaceEnumeration<String>- Specified by:
hasMoreElementsin classTokenizer- Returns:
- true if there are more elements available
-
nextElement
Returns N-grams and also (N-1)-grams and .... and 1-grams.- Specified by:
nextElementin interfaceEnumeration<String>- Specified by:
nextElementin classTokenizer- Returns:
- the next element
-
tokenize
Sets the string to tokenize. Tokenization happens immediately. -
getRevision
Returns the revision string.- Returns:
- the revision
-
main
Runs the tokenizer with the given options and strings to tokenize. The tokens are printed to stdout.- Parameters:
args- the commandline options and strings to tokenize
-