Package weka.core.converters
Class DictionarySaver
java.lang.Object
weka.core.converters.AbstractSaver
weka.core.converters.AbstractFileSaver
weka.core.converters.DictionarySaver
- All Implemented Interfaces:
Serializable,CapabilitiesHandler,CapabilitiesIgnorer,BatchConverter,FileSourcedConverter,IncrementalConverter,Saver,EnvironmentHandler,OptionHandler,RevisionHandler
public class DictionarySaver
extends AbstractFileSaver
implements BatchConverter, IncrementalConverter
Writes a dictionary constructed from string
attributes in incoming instances to a destination.
Valid options are:
Valid options are:
-binary-dict Save as a binary serialized dictionary
-R <range> Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values.
-V Set attributes selection mode. If false, only selected attributes in the range will be worked on. If true, only non-selected attributes will be processed
-L Convert all tokens to lowercase when matching against dictionary entries.
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-stopwords-handler <spec> The stopwords handler to use (default = Null)
-tokenizer <spec> The tokenizing algorithm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-P <integer> Prune the dictionary every x instances (default = 0 - i.e. no periodic pruning)
-W <integer> The number of words (per class if there is a class attribute assigned) to attempt to keep.
-M <integer> The minimum term frequency to use when pruning the dictionary (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-sort Sort the dictionary alphabetically
-i <the input file> The input file
-o <the output file> The output file
- Version:
- $Revision: 12690 $
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com)
- See Also:
-
Field Summary
Fields inherited from interface weka.core.converters.Saver
BATCH, INCREMENTAL, NONE -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionGets the current range selection.Returns the Capabilities of this saver.booleanGet the DoNotOperateOnPerClassBasis value.to be pverriddenbooleanGets whether the supplied columns are to be processed or skipped.booleanGet whether to keep the dictionary sorted alphabetically or notbooleanGets whether if the tokens are to be downcased or not.intGet the MinTermFreq value.longGets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.Returns the revision string.booleanGet whether to save the dictionary as a binary serialized dictionary, rather than a plain text oneReturns the current stemming algorithm, null if none is used.Gets the stopwords handler.Returns the current tokenizer algorithm.intGets the number of words (per class if there is a class attribute assigned) to attempt to keep.Returns a string describing this Saver.static voidvoidresets the optionsvoidSets the writer to null.voidsetAttributeIndices(String rangeList) Sets which attributes are to be worked on.voidsetDestination(OutputStream output) Sets the destination output stream.voidsetDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.voidsetInvertSelection(boolean invert) Sets whether selected columns should be processed or skipped.voidsetKeepDictionarySorted(boolean sorted) Set whether to keep the dictionary sorted alphabetically or notvoidsetLowerCaseTokens(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not.voidsetMinTermFreq(int newMinTermFreq) Set the MinTermFreq value.voidsetPeriodicPruning(long newPeriodicPruning) Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.voidsetSaveBinaryDictionary(boolean binary) Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text onevoidsetStemmer(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).voidSets the stopwords handler to use.voidsetTokenizer(Tokenizer value) the tokenizer algorithm to use.voidsetWordsToKeep(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.voidWrites to a file in batch mode To be overridden.voidwriteIncremental(Instance inst) Method for incremental saving.Methods inherited from class weka.core.converters.AbstractFileSaver
cancel, filePrefix, getFileExtension, getFileExtensions, getOptions, getUseRelativePath, getWriter, listOptions, retrieveDir, retrieveFile, runFileSaver, setDestination, setDir, setDirAndPrefix, setEnvironment, setFile, setFilePrefix, setOptions, setUseRelativePath, useRelativePathTipTextMethods inherited from class weka.core.converters.AbstractSaver
doNotCheckCapabilitiesTipText, getDoNotCheckCapabilities, getInstances, getWriteMode, resetStructure, setDoNotCheckCapabilities, setInstances, setRetrieval, setStructure
-
Constructor Details
-
DictionarySaver
public DictionarySaver()
-
-
Method Details
-
globalInfo
Returns a string describing this Saver.- Returns:
- a description of the Saver suitable for displaying in the explorer/experimenter gui
-
setSaveBinaryDictionary
@OptionMetadata(displayName="Save dictionary in binary form", description="Save as a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setSaveBinaryDictionary(boolean binary) Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text one- Parameters:
binary- true if the dictionary is to be saved as binary rather than plain text
-
getSaveBinaryDictionary
public boolean getSaveBinaryDictionary()Get whether to save the dictionary as a binary serialized dictionary, rather than a plain text one- Returns:
- true if the dictionary is to be saved as binary rather than plain text
-
getAttributeIndices
Gets the current range selection.- Returns:
- a string containing a comma separated list of ranges
-
setAttributeIndices
@OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList) Sets which attributes are to be worked on.- Parameters:
rangeList- a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last- Throws:
IllegalArgumentException- if an invalid range list is supplied
-
getInvertSelection
public boolean getInvertSelection()Gets whether the supplied columns are to be processed or skipped.- Returns:
- true if the supplied columns will be kept
-
setInvertSelection
@OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert) Sets whether selected columns should be processed or skipped.- Parameters:
invert- the new invert setting
-
getLowerCaseTokens
public boolean getLowerCaseTokens()Gets whether if the tokens are to be downcased or not.- Returns:
- true if the tokens are to be downcased.
-
setLowerCaseTokens
@OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).- Parameters:
downCaseTokens- should be true if only lower case tokens are to be formed.
-
setStemmer
@OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).- Parameters:
value- the configured stemming algorithm, or null- See Also:
-
getStemmer
Returns the current stemming algorithm, null if none is used.- Returns:
- the current stemming algorithm, null if none set
-
setStopwordsHandler
@OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value) Sets the stopwords handler to use.- Parameters:
value- the stopwords handler, if null, Null is used
-
getStopwordsHandler
Gets the stopwords handler.- Returns:
- the stopwords handler
-
setTokenizer
@OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value) the tokenizer algorithm to use.- Parameters:
value- the configured tokenizing algorithm
-
getTokenizer
Returns the current tokenizer algorithm.- Returns:
- the current tokenizer algorithm
-
getPeriodicPruning
public long getPeriodicPruning()Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.- Returns:
- the rate at which the dictionary is periodically pruned
-
setPeriodicPruning
@OptionMetadata(displayName="Periodic pruning rate", description="Prune the dictionary every x instances\n(default = 0 - i.e. no periodic pruning)", commandLineParamName="P", commandLineParamSynopsis="-P <integer>", displayOrder=14) public void setPeriodicPruning(long newPeriodicPruning) Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.- Parameters:
newPeriodicPruning- the rate at which the dictionary is periodically pruned
-
getWordsToKeep
public int getWordsToKeep()Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Returns:
- the target number of words in the output vector (per class if assigned).
-
setWordsToKeep
@OptionMetadata(displayName="Number of words to attempt to keep", description="The number of words (per class if there is a class attribute assigned) to attempt to keep.", commandLineParamName="W", commandLineParamSynopsis="-W <integer>", displayOrder=15) public void setWordsToKeep(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Parameters:
newWordsToKeep- the target number of words in the output vector (per class if assigned).
-
getMinTermFreq
public int getMinTermFreq()Get the MinTermFreq value.- Returns:
- the MinTermFreq value.
-
setMinTermFreq
@OptionMetadata(displayName="Minimum term frequency", description="The minimum term frequency to use when pruning the dictionary\n(default = 1).", commandLineParamName="M", commandLineParamSynopsis="-M <integer>", displayOrder=16) public void setMinTermFreq(int newMinTermFreq) Set the MinTermFreq value.- Parameters:
newMinTermFreq- The new MinTermFreq value.
-
getDoNotOperateOnPerClassBasis
public boolean getDoNotOperateOnPerClassBasis()Get the DoNotOperateOnPerClassBasis value.- Returns:
- the DoNotOperateOnPerClassBasis value.
-
setDoNotOperateOnPerClassBasis
@OptionMetadata(displayName="Do not operate on a per-class basis", description="If this is set, the maximum number of words and the\nminimum term frequency is not enforced on a per-class\nbasis but based on the documents in all the classes\n(even if a class attribute is set).", commandLineParamName="O", commandLineParamSynopsis="-O", commandLineParamIsFlag=true, displayOrder=17) public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.- Parameters:
newDoNotOperateOnPerClassBasis- The new DoNotOperateOnPerClassBasis value.
-
setKeepDictionarySorted
@OptionMetadata(displayName="Sort dictionary", description="Sort the dictionary alphabetically", commandLineParamName="sort", commandLineParamSynopsis="-sort", commandLineParamIsFlag=true, displayOrder=18) public void setKeepDictionarySorted(boolean sorted) Set whether to keep the dictionary sorted alphabetically or not- Parameters:
sorted- true to keep the dictionary sorted
-
getKeepDictionarySorted
public boolean getKeepDictionarySorted()Get whether to keep the dictionary sorted alphabetically or not- Returns:
- true to keep the dictionary sorted
-
getCapabilities
Returns the Capabilities of this saver.- Specified by:
getCapabilitiesin interfaceCapabilitiesHandler- Overrides:
getCapabilitiesin classAbstractSaver- Returns:
- the capabilities of this object
- See Also:
-
getFileDescription
Description copied from class:AbstractFileSaverto be pverridden- Specified by:
getFileDescriptionin interfaceFileSourcedConverter- Specified by:
getFileDescriptionin classAbstractFileSaver- Returns:
- the file type description.
-
writeIncremental
Description copied from class:AbstractSaverMethod for incremental saving. Standard behaviour: no incremental saving is possible, therefore throw an IOException. An incremental saving process is stopped by calling this method with null.- Specified by:
writeIncrementalin interfaceSaver- Overrides:
writeIncrementalin classAbstractSaver- Parameters:
inst- the instance to be saved- Throws:
IOException- IOEXception if the instance acnnot be written to the specified destination
-
writeBatch
Description copied from class:AbstractSaverWrites to a file in batch mode To be overridden.- Specified by:
writeBatchin interfaceSaver- Specified by:
writeBatchin classAbstractSaver- Throws:
IOException- exception if writting is not possible
-
resetOptions
public void resetOptions()Description copied from class:AbstractFileSaverresets the options- Overrides:
resetOptionsin classAbstractFileSaver
-
resetWriter
public void resetWriter()Description copied from class:AbstractFileSaverSets the writer to null.- Overrides:
resetWriterin classAbstractFileSaver
-
setDestination
Description copied from class:AbstractFileSaverSets the destination output stream.- Specified by:
setDestinationin interfaceSaver- Overrides:
setDestinationin classAbstractFileSaver- Parameters:
output- the output stream.- Throws:
IOException- throws an IOException if destination cannot be set
-
getRevision
Description copied from interface:RevisionHandlerReturns the revision string.- Specified by:
getRevisionin interfaceRevisionHandler- Returns:
- the revision
-
main
-