怎么查重python文本相似性计算simhash源码

发布时间：2022-02-11 14:57:25 来源：亿速云阅读：375 作者：iii 栏目：开发技术

今天小编给大家分享一下怎么查重python文本相似性计算simhash源码的相关知识点，内容详细，逻辑清晰，相信大部分人都还太了解这方面的知识，所以分享这篇文章给大家参考一下，希望大家阅读完这篇文章后有所收获，下面我们一起来了解一下吧。

场景：

1.计算SimHash值，及Hamming距离。
2.SimHash适用于较长文本（大于三五百字）的相似性比较，文本越短误判率越高。

Python实现：

代码如下

# -*- encoding:utf-8 -*- import math import jieba import jieba.analyse class SimHash(object):     def getBinStr(self, source):         if source == "":             return 0         else:             x = ord(source[0]) << 7             m = 1000003             mask = 2 ** 128 - 1             for c in source:                 x = ((x * m) ^ ord(c)) & mask             x ^= len(source)             if x == -1:                 x = -2             x = bin(x).replace('0b', '').zfill(64)[-64:]             return str(x)     def getWeight(self, source):         return ord(source)     def unwrap_weight(self, arr):         ret = ""         for item in arr:             tmp = 0             if int(item) > 0:                 tmp = 1             ret += str(tmp)         return ret     def sim_hash(self, rawstr):         seg = jieba.cut(rawstr)         keywords = jieba.analyse.extract_tags("|".join(seg), topK=100, withWeight=True)         ret = []         for keyword, weight in keywords:             binstr = self.getBinStr(keyword)             keylist = []             for c in binstr:                 weight = math.ceil(weight)                 if c == "1":                     keylist.append(int(weight))                 else:                     keylist.append(-int(weight))             ret.append(keylist)         # 降维         rows = len(ret)         cols = len(ret[0])         result = []         for i in range(cols):             tmp = 0             for j in range(rows):                 tmp += int(ret[j][i])             if tmp > 0:                 tmp = "1"             elif tmp <= 0:                 tmp = "0"             result.append(tmp)         return "".join(result)     def distince(self, hashstr1, hashstr2):         length = 0         for index, char in enumerate(hashstr1):             if char == hashstr2[index]:                 continue             else:                 length += 1         return length if __name__ == "__main__":     simhash = SimHash()     str1 = '咱哥俩谁跟谁啊'     str2 = '咱们俩谁跟谁啊'     hash2 = simhash.sim_hash(str1)     print(hash2)     hash3 = simhash.sim_hash(str2)     distince = simhash.distince(hash2, hash3)     value = 5     print("simhash", distince, "距离：", value, "是否相似：", distince<=value)

以上就是“怎么查重python文本相似性计算simhash源码”这篇文章的所有内容，感谢各位的阅读！相信大家阅读完这篇文章都有很大的收获，小编每天都会为大家更新不同的知识，如果还想学习更多的知识，请关注亿速云行业资讯频道。

向AI问一下细节

怎么查重python文本相似性计算simhash源码

场景：

Python实现：

猜你喜欢

最新资讯

相关推荐

相关标签