SomeInfos Header

Levenshtein Distance

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e. an edit distance). The term edit distance is often used to refer specifically to Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

In approximate string matching, the objective is to find matches for short strings, for instance, strings from a dictionary, in many longer texts, in situations where a small number of differences is to be expected. Here, one of the strings is typically short, while the other is arbitrarily long. This has a wide range of applications, for instance, spell checkers, correction systems for optical character recognition, and software to assist natural language translation based on translation memory.

The Levenshtein distance can also be computed between two longer strings, but the cost to compute it, which is roughly proportional to the product of the two string lengths, makes this impractical.

Levenshtein distance is not the only popular notion of edit distance. Variations can be obtained by changing the set of allowable edit operations: for instance,

Edit distance in general is usually defined as a parametrizable metric in which a repertoire of edit operations is available, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

A commonly-used bottom-up dynamic programming algorithm for computing the Levenshtein distance involves the use of an (n + 1) × (m + 1) matrix, where n and m are the lengths of the two strings. This algorithm is based on the Wagner-Fisher algorithm for edit distance. Here is pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and computes the Levenshtein distance between them:

Two examples of the resulting matrix (hovering over a number reveals the operation performed to get that number):

The invariant maintained throughout the algorithm is that we can transform the initial segment s into t using a minimum of d operations. At the end, the bottom-right element of the array contains the answer.

As mentioned earlier, the invariant is that we can transform the initial segment s into t using a minimum of d operations. This invariant holds since:

This proof fails to validate that the number placed in d is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d is smaller than the minimum of the three, and use this to show one of the three is not minimal.

Possible improvements to this algorithm include:

The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:

DISCLAIMER
2 visitors online
SomeInfos Header
SomeInfos Header
Developed by Opti-Web