strutil provides a collection of string metrics for calculating string similarity as well as other string utility functions.
Full documentation can be found at https://pkg.go.dev/github.com/adrg/strutil.
go get github.com/adrg/strutil - Hamming
- Levenshtein
- Jaro
- Jaro-Winkler
- Smith-Waterman-Gotoh
- Sorensen-Dice
- Jaccard
- Overlap Coefficient
The package defines the StringMetric interface, which is implemented by all the string metrics. The interface is used with the Similarity function, which calculates the similarity between the specified strings, using the provided string metric.
type StringMetric interface { Compare(a, b string) float64 } func Similarity(a, b string, metric StringMetric) float64 { }All defined string metrics can be found in the metrics package.
Calculate similarity.
similarity := strutil.Similarity("text", "test", metrics.NewHamming()) fmt.Printf("%.2f\n", similarity) // Output: 0.75Calculate distance.
ham := metrics.NewHamming() fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2More information and additional examples can be found on pkg.go.dev.
Calculate similarity using default options.
similarity := strutil.Similarity("graph", "giraffe", metrics.NewLevenshtein()) fmt.Printf("%.2f\n", similarity) // Output: 0.43Configure edit operation costs.
lev := metrics.NewLevenshtein() lev.CaseSensitive = false lev.InsertCost = 1 lev.ReplaceCost = 2 lev.DeleteCost = 1 similarity := strutil.Similarity("make", "Cake", lev) fmt.Printf("%.2f\n", similarity) // Output: 0.50Calculate distance.
lev := metrics.NewLevenshtein() fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4More information and additional examples can be found on pkg.go.dev.
similarity := strutil.Similarity("think", "tank", metrics.NewJaro()) fmt.Printf("%.2f\n", similarity) // Output: 0.78More information and additional examples can be found on pkg.go.dev.
similarity := strutil.Similarity("think", "tank", metrics.NewJaroWinkler()) fmt.Printf("%.2f\n", similarity) // Output: 0.80More information and additional examples can be found on pkg.go.dev.
Calculate similarity using default options.
swg := metrics.NewSmithWatermanGotoh() similarity := strutil.Similarity("times roman", "times new roman", swg) fmt.Printf("%.2f\n", similarity) // Output: 0.82Customize gap penalty and substitution function.
swg := metrics.NewSmithWatermanGotoh() swg.CaseSensitive = false swg.GapPenalty = -0.1 swg.Substitution = metrics.MatchMismatch { Match: 1, Mismatch: -0.5, } similarity := strutil.Similarity("Times Roman", "times new roman", swg) fmt.Printf("%.2f\n", similarity) // Output: 0.96More information and additional examples can be found on pkg.go.dev.
Calculate similarity using default options.
sd := metrics.NewSorensenDice() similarity := strutil.Similarity("time to make haste", "no time to waste", sd) fmt.Printf("%.2f\n", similarity) // Output: 0.62Customize n-gram size.
sd := metrics.NewSorensenDice() sd.CaseSensitive = false sd.NgramSize = 3 similarity := strutil.Similarity("Time to make haste", "no time to waste", sd) fmt.Printf("%.2f\n", similarity) // Output: 0.53More information and additional examples can be found on pkg.go.dev.
Calculate similarity using default options.
j := metrics.NewJaccard() similarity := strutil.Similarity("time to make haste", "no time to waste", j) fmt.Printf("%.2f\n", similarity) // Output: 0.45Customize n-gram size.
j := metrics.NewJaccard() j.CaseSensitive = false j.NgramSize = 3 similarity := strutil.Similarity("Time to make haste", "no time to waste", j) fmt.Printf("%.2f\n", similarity) // Output: 0.36The input of the Sorensen-Dice example is the same as the one of Jaccard because the metrics bear a resemblance to each other. In fact, each of the coefficients can be used to calculate the other one.
Sorensen-Dice to Jaccard.
J = SD/(2-SD) where SD is the Sorensen-Dice coefficient and J is the Jaccard index. Jaccard to Sorensen-Dice.
SD = 2*J/(1+J) where SD is the Sorensen-Dice coefficient and J is the Jaccard index. More information and additional examples can be found on pkg.go.dev.
Calculate similarity using default options.
oc := metrics.NewOverlapCoefficient() similarity := strutil.Similarity("time to make haste", "no time to waste", oc) fmt.Printf("%.2f\n", similarity) // Output: 0.67Customize n-gram size.
oc := metrics.NewOverlapCoefficient() oc.CaseSensitive = false oc.NgramSize = 3 similarity := strutil.Similarity("Time to make haste", "no time to waste", oc) fmt.Printf("%.2f\n", similarity) // Output: 0.57More information and additional examples can be found on pkg.go.dev.
For more information see:
- Hamming distance
- Levenshtein distance
- Jaro-Winkler distance
- Smith-Waterman algorithm
- Sorensen-Dice coefficient
- Jaccard index
- Overlap coefficient
Contributions in the form of pull requests, issues or just general feedback, are always welcome.
See CONTRIBUTING.MD.
Copyright (c) 2019 Adrian-George Bostan.
This project is licensed under the MIT license. See LICENSE for more details.