- save(ObjectOutputStream) - Method in class org.apdplat.word.dictionary.impl.AhoCorasickDoubleArrayTrie
-
Save
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.CosineTextSimilarity
-
判定相似度的方式:余弦相似度
余弦夹角原理:
向量a=(x1,y1),向量b=(x2,y2)
similarity=a.b/|a|*|b|
a.b=x1x2+y1y2
|a|=根号[(x1)^2+(y1)^2],|b|=根号[(x2)^2+(y2)^2]
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.EditDistanceTextSimilarity
-
计算相似度分值
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.EuclideanDistanceTextSimilarity
-
判定相似度的方式:欧几里得距离
欧几里得距离原理:
设A(x1, y1),B(x2, y2)是平面上任意两点
两点间的距离dist(A,B)=sqrt((x1-x2)^2+(y1-y2)^2)
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.JaccardTextSimilarity
-
判定相似度的方式:Jaccard相似性系数
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.JaroDistanceTextSimilarity
-
计算相似度分值
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.JaroWinklerDistanceTextSimilarity
-
计算相似度分值
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.ManhattanDistanceTextSimilarity
-
判定相似度的方式:曼哈顿距离
曼哈顿距离原理:
设A(x1, y1),B(x2, y2)是平面上任意两点
两点间的距离dist(A,B)=|x1-x2|+|y1-y2|
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.SimHashPlusHammingDistanceTextSimilarity
-
计算相似度分值
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.SimpleTextSimilarity
-
计算相似度分值
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.SørensenDiceCoefficientTextSimilarity
-
计算相似度分值
- scoreImpl(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.TextSimilarity
-
计算相似度分值
- seg(String, boolean, char...) - Static method in class org.apdplat.word.recognition.Punctuation
-
将一段文本根据标点符号分割为多个不包含标点符号的文本
可指定要保留那些标点符号
- seg(String) - Method in class org.apdplat.word.segmentation.impl.AbstractSegmentation
-
默认分词算法实现:
1、把要分词的文本根据标点符号进行分割
2、对分割后的文本进行分词
3、组合分词结果
- seg(String) - Method in class org.apdplat.word.segmentation.impl.PureEnglish
-
- seg(String) - Method in interface org.apdplat.word.segmentation.Segmentation
-
将文本切分为词
- seg(String) - Static method in class org.apdplat.word.segmentation.SegmentationContrast
-
- seg(File, File, boolean, SegmentationAlgorithm) - Static method in class org.apdplat.word.util.Utils
-
对文件进行分词
- seg(File, File, boolean, SegmentationAlgorithm, Utils.FileSegmentationCallback) - Static method in class org.apdplat.word.util.Utils
-
对文件进行分词
- seg(String) - Method in class org.apdplat.word.WordFrequencyStatistics
-
对文本进行分词
- seg(File, File) - Method in class org.apdplat.word.WordFrequencyStatistics
-
对文件进行分词
- seg(String, SegmentationAlgorithm) - Static method in class org.apdplat.word.WordSegmenter
-
对文本进行分词,移除停用词
可指定其他分词算法
- seg(String) - Static method in class org.apdplat.word.WordSegmenter
-
对文本进行分词,移除停用词
使用双向最大匹配算法
- seg(File, File, SegmentationAlgorithm) - Static method in class org.apdplat.word.WordSegmenter
-
对文件进行分词,移除停用词
可指定其他分词算法
- seg(File, File) - Static method in class org.apdplat.word.WordSegmenter
-
对文件进行分词,移除停用词
使用双向最大匹配算法
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.AbstractSegmentation
-
具体的分词实现,留待子类实现
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.BidirectionalMaximumMatching
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.BidirectionalMaximumMinimumMatching
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.BidirectionalMinimumMatching
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.FullSegmentation
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.MaximumMatching
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.MaxNgramScore
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.MinimalWordCount
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.MinimumMatching
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.ReverseMaximumMatching
-
- segImpl(String) - Method in class org.apdplat.word.segmentation.impl.ReverseMinimumMatching
-
- Segmentation - Interface in org.apdplat.word.segmentation
-
中文分词接口
Chinese Word Segmentation Interface
- SegmentationAlgorithm - Enum in org.apdplat.word.segmentation
-
中文分词算法
Chinese word segmentation algorithm
- SegmentationContrast - Class in org.apdplat.word.segmentation
-
对比各种分词算法的分词结果
- SegmentationContrast() - Constructor for class org.apdplat.word.segmentation.SegmentationContrast
-
- SegmentationFactory - Class in org.apdplat.word.segmentation
-
中文分词工厂类
根据指定的分词算法返回分词实现
- segWithStopWords(String, SegmentationAlgorithm) - Static method in class org.apdplat.word.WordSegmenter
-
对文本进行分词,保留停用词
可指定其他分词算法
- segWithStopWords(String) - Static method in class org.apdplat.word.WordSegmenter
-
对文本进行分词,保留停用词
使用双向最大匹配算法
- segWithStopWords(File, File, SegmentationAlgorithm) - Static method in class org.apdplat.word.WordSegmenter
-
对文件进行分词,保留停用词
可指定其他分词算法
- segWithStopWords(File, File) - Static method in class org.apdplat.word.WordSegmenter
-
对文件进行分词,保留停用词
使用双向最大匹配算法
- set(float) - Method in class org.apdplat.word.util.AtomicFloat
-
- set(String, String) - Static method in class org.apdplat.word.util.WordConfTools
-
- setAcronymPinYin(String) - Method in class org.apdplat.word.segmentation.Word
-
- setAntonym(List<Word>) - Method in class org.apdplat.word.segmentation.Word
-
- setDes(String) - Method in class org.apdplat.word.segmentation.PartOfSpeech
-
- setDictionary(Dictionary) - Method in interface org.apdplat.word.segmentation.DictionaryBasedSegmentation
-
为基于词典的中文分词接口指定词典操作接口
- setDictionary(Dictionary) - Method in class org.apdplat.word.segmentation.impl.AbstractSegmentation
-
为基于词典的中文分词接口指定词典操作接口
- setFrequency(int) - Method in class org.apdplat.word.segmentation.Word
-
- setFullPinYin(String) - Method in class org.apdplat.word.segmentation.Word
-
- setHashBitCount(int) - Method in class org.apdplat.word.analysis.SimHashPlusHammingDistanceTextSimilarity
-
- setLimit(int) - Method in class org.apdplat.word.vector.Distance
-
- setPartOfSpeech(PartOfSpeech) - Method in class org.apdplat.word.segmentation.Word
-
- setPerfectCharCount(int) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- setPerfectLineCount(int) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- setPos(String) - Method in class org.apdplat.word.segmentation.PartOfSpeech
-
- setRemoveStopWord(boolean) - Method in class org.apdplat.word.WordFrequencyStatistics
-
设置是否移除停用词
- setResultPath(String) - Method in class org.apdplat.word.WordFrequencyStatistics
-
设置词频统计结果保存路径
- setScore(Double) - Method in class org.apdplat.word.analysis.Hit
-
- setSegmentationAlgorithm(SegmentationAlgorithm) - Method in class org.apdplat.word.analysis.TextSimilarity
-
- setSegmentationAlgorithm(SegmentationAlgorithm) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- setSegmentationAlgorithm(SegmentationAlgorithm) - Method in class org.apdplat.word.WordFrequencyStatistics
-
设置分词算法
- setSegSpeed(float) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- setSynonym(List<Word>) - Method in class org.apdplat.word.segmentation.Word
-
- setText(String) - Method in class org.apdplat.word.analysis.Hit
-
- setText(String) - Method in class org.apdplat.word.segmentation.Word
-
- setTextSimilarity(TextSimilarity) - Method in class org.apdplat.word.vector.Distance
-
- setTotalCharCount(int) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- setTotalLineCount(int) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- setWeight(Float) - Method in class org.apdplat.word.segmentation.Word
-
- setWrongCharCount(int) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- setWrongLineCount(int) - Method in class org.apdplat.word.corpus.EvaluationResult
-
- shorterText - Variable in class org.apdplat.word.analysis.JaroDistanceTextSimilarity
-
- show(char) - Method in class org.apdplat.word.dictionary.impl.DictionaryTrie
-
- show() - Method in class org.apdplat.word.dictionary.impl.DictionaryTrie
-
- show(char) - Method in class org.apdplat.word.util.GenericTrie
-
- show() - Method in class org.apdplat.word.util.GenericTrie
-
- showConflict() - Method in class org.apdplat.word.dictionary.impl.DictionaryTrie
-
统计根节点冲突情况及预分配的数组空间利用情况
- showConflict() - Method in class org.apdplat.word.util.GenericTrie
-
统计根节点冲突情况及预分配的数组空间利用情况
- showUsage() - Static method in class org.apdplat.word.segmentation.SegmentationContrast
-
- SimHashPlusHammingDistanceTextSimilarity - Class in org.apdplat.word.analysis
-
文本相似度计算
判定方式:SimHash + 汉明距离(Hamming Distance)
先使用SimHash把不同长度的文本映射为等长文本,然后再计算等长文本的汉明距离
simhash和普通hash最大的不同在于:
普通hash对 仅有一个字节不同的文本 会映射成 两个完全不同的哈希结果
simhash对 相似的文本 会映射成 相似的哈希结果
汉明距离是以美国数学家Richard Wesley Hamming的名字命名的
两个等长字符串之间的汉明距离是两个字符串相应位置的不同字符的个数
换句话说,它就是将一个字符串变换成另外一个字符串所需要替换的字符个数
比如:
1011101 与 1001001 之间的汉明距离是 2
2143896 与 2233796 之间的汉明距离是 3
toned 与 roses 之间的汉明距离是 3
- SimHashPlusHammingDistanceTextSimilarity() - Constructor for class org.apdplat.word.analysis.SimHashPlusHammingDistanceTextSimilarity
-
- SimHashPlusHammingDistanceTextSimilarity(int) - Constructor for class org.apdplat.word.analysis.SimHashPlusHammingDistanceTextSimilarity
-
- Similarity - Interface in org.apdplat.word.analysis
-
相似度
- SimilarityRanker - Interface in org.apdplat.word.analysis
-
相似度排名
- similarScore(String, String) - Method in interface org.apdplat.word.analysis.Similarity
-
对象1和对象2的相似度分值
- similarScore(List<Word>, List<Word>) - Method in interface org.apdplat.word.analysis.Similarity
-
词列表1和词列表2的相似度分值
- similarScore(HashMap<Word, Float>, HashMap<Word, Float>) - Method in interface org.apdplat.word.analysis.Similarity
-
词及其权重映射1和词及其权重映射2的相似度分值
- similarScore(Map<String, Float>, Map<String, Float>) - Method in interface org.apdplat.word.analysis.Similarity
-
词及其权重映射1和词及其权重映射2的相似度分值
- similarScore(String, String) - Method in class org.apdplat.word.analysis.TextSimilarity
-
文本1和文本2的相似度分值
- similarScore(List<Word>, List<Word>) - Method in class org.apdplat.word.analysis.TextSimilarity
-
词列表1和词列表2的相似度分值
- SimpleTextSimilarity - Class in org.apdplat.word.analysis
-
文本相似度计算
判定方式:简单共有词,通过计算两篇文档共有的词的总字符数除以最长文档字符数来评估他们的相似度
算法步骤描述:
1、分词
2、求交集(去重),累加交集的所有的词的字符数得到 intersectionLength
3、求最长文本字符数 Math.max(words1Length, words2Length)
4、2中的值除以3中的值 intersectionLength/(double)Math.max(words1Length, words2Length)
完整计算公式:
double score = intersectionLength/(double)Math.max(words1Length, words2Length);
- SimpleTextSimilarity() - Constructor for class org.apdplat.word.analysis.SimpleTextSimilarity
-
- size() - Method in class org.apdplat.word.analysis.Hits
-
- size - Variable in class org.apdplat.word.dictionary.impl.AhoCorasickDoubleArrayTrie
-
the size of base and check array
- size() - Method in class org.apdplat.word.dictionary.impl.AhoCorasickDoubleArrayTrie
-
Get the size of the keywords
- split(Word) - Static method in class org.apdplat.word.segmentation.WordRefiner
-
将一个词拆分成几个,返回null表示不能拆分
- StopWord - Class in org.apdplat.word.recognition
-
停用词判定
通过系统属性及配置文件指定停用词词典(stopwords.path)
指定方式一,编程指定(高优先级):
WordConfTools.set("stopwords.path", "classpath:stopwords.txt");
指定方式二,Java虚拟机启动参数(中优先级):
java -Dstopwords.path=classpath:stopwords.txt
指定方式三,配置文件指定(低优先级):
在类路径下的word.conf中指定配置信息
stopwords.path=classpath:stopwords.txt
如未指定,则默认使用停用词词典文件(类路径下的stopwords.txt)
- StopWord() - Constructor for class org.apdplat.word.recognition.StopWord
-
- SynonymTagging - Class in org.apdplat.word.tagging
-
同义标注
- SørensenDiceCoefficientTextSimilarity - Class in org.apdplat.word.analysis
-
文本相似度计算
判定方式:Sørensen–Dice系数(Sørensen–Dice coefficient),通过计算两个集合交集的大小的2倍除以两个集合的大小之和来评估他们的相似度
算法步骤描述:
1、分词
2、求交集(去重),计算交集的不重复词的个数 intersectionSize
3、两个集合的大小分别为 set1Size 和 set2Size
4、相似度分值 = 2*intersectionSize/(set1Size+set2Size)
完整计算公式:
double score = 2*intersectionSize/(set1Size+set2Size);
- SørensenDiceCoefficientTextSimilarity() - Constructor for class org.apdplat.word.analysis.SørensenDiceCoefficientTextSimilarity
-