Basic comparison of two strings (i.e. using an equality operator) allows you to determine if they are equal but not if they are similar.
Why would someone want to determine the similarity of two strings? A few applications are : spell-checking (the famous autocorrect), anti spam techniques…
To determine the similarity a « few » algorithms exist. Some of them use this principle :
The similarity of two strings can be computed from the number of operations necessary to make those two strings equal.
The principal operations are : substition, insertion, deletion. Simple examples:
- COLD and FOLD => Only one operation (substitution) to make the strings equal.
- COLD and HOT => Three operations (two substitutions and an insertion)
Levenshtein distance
A simple C# implementation of the Levenshtein algorithm :
static void Main(string[] args)
{
var strings = new string[]
{
"carrotte",
"carreau",
"crarotte",
"crotter",
"",
"carote"
};
//should be tolower
var originalString = "carote";
//main loop to confront each string with the originalOne
foreach (var str in strings)
{
int levenshteinDistance = 0;
//to avoid loop turns
if (str == originalString)
{
//Strings are identical => distance = 0
continue;
}
if (str.Length == 0)
{
//str is empty => distance = originalString.Length
continue;
}
if (originalString.Length == 0)
{
//originalString is empty => distance = str.Length
continue;
}
var minDistance = Math.Abs(str.Length - originalString.Length);
var maxDistance = Math.Max(str.Length, originalString.Length);
var distanceMatrix = new int[str.Length+1, originalString.Length+1];
for (int i = 0; i <= str.Length; distanceMatrix[i, 0] = i++) ;
for (int j = 0; j <= originalString.Length; distanceMatrix[0, j] = j++) ;
for (int j = 1; j <= originalString.Length; j++)
{
for (int i = 1; i <= str.Length; i++)
{
//equality for the characters
if (str[i-1] == originalString[j-1])
{
distanceMatrix[i, j] = distanceMatrix[i - 1, j - 1];
}
else
{
distanceMatrix[i, j] = Math.Min(
Math.Min(
distanceMatrix[i-1,j]+1, //deletion
distanceMatrix[i,j-1]+1), //insertion
distanceMatrix[i-1,j-1]+1 //substition
);
}
}
}
//higher i,j point is the Levenshtein distance
levenshteinDistance = distanceMatrix[str.Length, originalString.Length];
var result = 100 - (double)(levenshteinDistance ) / (maxDistance) * 100;
}
}
Ahd here are a few resutls :
Going further
As said this implementation is really simplistic. To be more accurate we could add an operation « transposition » (i.e. swap) to check if two characters are inverted and count that as one operation instead of two substitions.
Also each operation has a cost of 1 regardless of its nature and the ‘difference’ between the two characters being compared. We could add a keyboard-distance-based cost between two letters.
comments powered by Disqus