Basic comparison of two strings (i.e. using an equality operator) allows you to determine if they are equal but not if they are *similar*.

Why would someone want to determine the similarity of two strings? A few applications are : spell-checking (the famous autocorrect), anti spam techniques…

To determine the similarity a « few » algorithms exist. Some of them use this principle :

The similarity of two strings can be computed from the number of operations necessary to make those two strings equal.

The principal operations are : substition, insertion, deletion. Simple examples:

- COLD and FOLD => Only
__one__operation (substitution) to make the strings equal. - COLD and HOT =>
__Three__operations (two substitutions and an insertion)

### Levenshtein distance

A simple C# implementation of the Levenshtein algorithm :

```
static void Main(string[] args)
{
var strings = new string[]
{
"carrotte",
"carreau",
"crarotte",
"crotter",
"",
"carote"
};
//should be tolower
var originalString = "carote";
//main loop to confront each string with the originalOne
foreach (var str in strings)
{
int levenshteinDistance = 0;
//to avoid loop turns
if (str == originalString)
{
//Strings are identical => distance = 0
continue;
}
if (str.Length == 0)
{
//str is empty => distance = originalString.Length
continue;
}
if (originalString.Length == 0)
{
//originalString is empty => distance = str.Length
continue;
}
var minDistance = Math.Abs(str.Length - originalString.Length);
var maxDistance = Math.Max(str.Length, originalString.Length);
var distanceMatrix = new int[str.Length+1, originalString.Length+1];
for (int i = 0; i <= str.Length; distanceMatrix[i, 0] = i++) ;
for (int j = 0; j <= originalString.Length; distanceMatrix[0, j] = j++) ;
for (int j = 1; j <= originalString.Length; j++)
{
for (int i = 1; i <= str.Length; i++)
{
//equality for the characters
if (str[i-1] == originalString[j-1])
{
distanceMatrix[i, j] = distanceMatrix[i - 1, j - 1];
}
else
{
distanceMatrix[i, j] = Math.Min(
Math.Min(
distanceMatrix[i-1,j]+1, //deletion
distanceMatrix[i,j-1]+1), //insertion
distanceMatrix[i-1,j-1]+1 //substition
);
}
}
}
//higher i,j point is the Levenshtein distance
levenshteinDistance = distanceMatrix[str.Length, originalString.Length];
var result = 100 - (double)(levenshteinDistance ) / (maxDistance) * 100;
}
}
```

Ahd here are a few resutls :

### Going further

As said this implementation is really simplistic. To be more accurate we could add an operation « transposition » (i.e. swap) to check if two characters are inverted and count that as one operation instead of two substitions.

Also each operation has a cost of 1 regardless of its nature and the ‘difference’ between the two characters being compared. We could add a keyboard-distance-based cost between two letters.

comments powered by Disqus