public class TfidfTextVectorizerTransform extends Object implements Transform
| Modifier and Type | Field and Description |
|---|---|
protected org.canova.nlp.metadata.VocabCache |
cache |
static String |
MIN_WORD_FREQUENCY |
protected int |
minWordFrequency |
Map<String,org.apache.commons.math3.util.Pair<Integer,Integer>> |
recordLabels |
static String |
STOP_WORDS |
protected Collection<String> |
stopWords |
static String |
TOKENIZER |
protected org.canova.nlp.tokenization.tokenizerfactory.TokenizerFactory |
tokenizerFactory |
| Constructor and Description |
|---|
TfidfTextVectorizerTransform() |
| Modifier and Type | Method and Description |
|---|---|
void |
collectStatistics(Collection<Writable> vector)
Collect stats from the raw record (first pass)
Schema:
Writable[0]: go dogs, go 1
Writable[1]: label_A
1.
|
org.nd4j.linalg.api.ndarray.INDArray |
convertTextRecordToTFIDFVector(String textRecord) |
org.canova.nlp.tokenization.tokenizerfactory.TokenizerFactory |
createTokenizerFactory(Configuration conf) |
void |
debugPrintVocabList() |
void |
doWithTokens(org.canova.nlp.tokenization.tokenizer.Tokenizer tokenizer) |
void |
evaluateStatistics()
This is where we'll take the dataset stats learned from the first pass and setup for the
transform pass
|
Integer |
getLabelID(String label) |
int |
getNumberOfLabelsSeen() |
int |
getVocabularySize() |
void |
initialize(Configuration conf) |
void |
transform(Collection<Writable> vector)
Transform the raw record w stats we've learned from the first pass
Schema:
Writable[0]: go dogs, go 1
Writable[1]: label_A
1.
|
protected Counter<String> |
wordFrequenciesForSentence(String sentence) |
protected org.canova.nlp.tokenization.tokenizerfactory.TokenizerFactory tokenizerFactory
protected int minWordFrequency
public static final String MIN_WORD_FREQUENCY
public static final String STOP_WORDS
public static final String TOKENIZER
protected Collection<String> stopWords
protected org.canova.nlp.metadata.VocabCache cache
public int getVocabularySize()
public void debugPrintVocabList()
public void doWithTokens(org.canova.nlp.tokenization.tokenizer.Tokenizer tokenizer)
public org.canova.nlp.tokenization.tokenizerfactory.TokenizerFactory createTokenizerFactory(Configuration conf)
public void initialize(Configuration conf)
public org.nd4j.linalg.api.ndarray.INDArray convertTextRecordToTFIDFVector(String textRecord)
public void collectStatistics(Collection<Writable> vector)
collectStatistics in interface Transformpublic int getNumberOfLabelsSeen()
public void transform(Collection<Writable> vector)
public void evaluateStatistics()
evaluateStatistics in interface TransformCopyright © 2016. All rights reserved.