- Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
目标
- 用go实现字符串相似度lib
- 处理中文准确度较高(目前很多老外写的库处理中文效果不佳)
- 集成多种相似度算法(编辑距离,汉明编码,骰子系数)
莱文斯坦-编辑距离(Levenshtein)
- https://zhuanlan.zhihu.com/p/91667128
- https://www.jianshu.com/p/a617d20162cf
(以上两份参考资料都是创建矩阵,看完算法之后感悟,没有必要创建矩阵,只要缓存x坐标+对角线一个值就行,实现效果一样) - http://richardminerich.com/tag/damerau-levenshtein-distance/ (补充)
Hamming
Dice's coefficient
- https://blog.csdn.net/gjk0223/article/details/2314844
n个字符算集合一个元素,这点容易忽略,n是可以配置的,很多开源项目都忽略这点。原论文公式是 2 *(a 和b的交集) /(len(a) + len(b)),默认选择2,但是2对中文不太友好
Jaro
- https://www.jianshu.com/p/a4af202cb702 (good)
- https://blog.csdn.net/asty9000/article/details/81348857
TODO
- Damerau-Levenshtein - distance & normalized
- Jaro and Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix length
补充
参考API设计(取名)
参考选用了哪些算法名字
Metadata
Metadata
Assignees
Labels
No labels