Calculating hamming distance on a large data set. Hamming distance

On a set of binary words of length m distance d(a,b) between words a and b is the number of non-matching positions of these words, for example: the distance between words a = 01101 and b = 00111 is 2.

The concept defined in this way is called the Hamming distance.

It satisfies the following distance axioms:

1) d(a,b)  0 and d(a,b)=0 if and only if a = b;

2) d(a,b) = d(b,a) ;

3) d(a,b) + d(b,c)  d(a,c) (triangle inequality).

The weight w(a) of a word a is the number of ones among its coordinates. Then the distance between the words a and b is the weight of their sum a b: d(a,b)=w(a b) , where the symbol  denotes the operation of coordinate-wise addition modulo 2. Intuitively, the code is better suited to error detection and correction, the more different the code words are. The concept of Hamming distance allows us to clarify this.

Theorem In order for a code to detect errors in k (or less) positions, it is necessary and sufficient that the smallest distance between codewords be  k + 1.

The proof of this theorem is similar to the proof of the following statement.

Theorem. In order for the code to correct all errors in k (or less) positions, it is necessary and sufficient that the smallest distance between codewords be  2k + 1.

32. Theorem on the correcting ability of codes.

Codes that can automatically correct errors are called self-correcting. To build a self-correcting code designed to correct single errors, one check digit is not enough. As can be seen from what follows, the number of control bits k must be chosen so that the inequality 2k≥k+m+1or k≥log2(k+m+1) is satisfied, where m is the number of basic binary bits of the codeword. Currently, binary block correction codes are of greatest interest. When using such codes, information is transmitted in the form of blocks of the same length and each block is encoded and decoded independently of each other. In almost all block codes, characters can be divided into informational and verification.

The main characteristics of self-correcting codes are:

1. Number of allowed and prohibited combinations. If n is the number of symbols in the block, r is the number of check symbols in the block, k is the number of information symbols, then 2n is the number of possible code combinations, 2k is the number of allowed code combinations, 2n−2k is the number of prohibited combinations.

2. Code redundancy. The value rn is called the redundancy of the correction code.

3. Minimum code distance. The minimum code distance d is the minimum number of distorted symbols required to transition from one allowed combination to another.

4. Number of errors detected and corrected. If g is the number of errors that the code can correct, then it is necessary and sufficient that d≥2g+1

5. Corrective capabilities of codes.

33. Matrix coding. Group codes.

When explicitly specifying the encoding scheme in ( m, n)-code should specify 2 m codewords, which is very inefficient.

One economical way to describe a coding scheme is the matrix coding technique.

Previously, each encoding scheme was described by tables specifying a codeword of length n for each source word of length m. For blocks long length this method requires a large amount of memory and is therefore impractical. For example, for ( 16, 33 ) code will require 33 * 2 16 = 2,162,688 bits.

Requires much less memory matrix coding. Let E dimension matrix m×n, consisting of elements e ij , where i is the line number, and j - column number. Each of the matrix elements e ij can be either 0 or 1. Encoding is implemented by the operation b = aE or where code words are considered as vectors, i.e. as row matrices of size 1×n.

Encoding should not assign the same codeword to different source messages. A simple way to achieve this is to m matrix columns formed a unit matrix. When any vector is multiplied by the identity matrix, the same vector is obtained; therefore, different message vectors will correspond to different vectors of the systematic code.

Matrix codes are also called linear codes. For linear (n − r, n)-codes with minimum Hamming distance d exists Plotkin lower bound (Plotkin) for the minimum number of check bits r at n³ 2d − 1,

Binary ( An m, n) code is called a group code if its code words form a group.

Note that the set of all binary words of length m forms a commutative group with the operation of coordinate-wise addition modulo 2, in which the relation a a holds. Consequently, the set of message words a of length m is a commutative group.

Block code is called group, if its code words form a group.

If the code is a group code, then the smallest distance between two code words is equal to the smallest weight of the non-zero word.

This follows from the relation d(b i ,b j ) = w(b i + b j ).

When using a wildcard code, those and only those errors that correspond to error strings exactly equal to the codewords go undetected.

Such error lines translate one codeword into another.

Therefore, the probability that an error will remain undetected is equal to the sum of the probabilities of all error strings equal to codewords.

Set of all binary words a = a 1 ... a m length m forms an Abelian (commutative) group with respect to bitwise addition.

Let E - coding m×n-a matrix that has m × m- a submatrix with a nonzero determinant, for example, identity. Then the mapping a → a E translates a group of all binary words of length m to a group of codewords of length n.

Let us assume that Then for we get

i.e. Therefore, a one-to-one mapping of a group of binary words of length m using a given matrix E preserves the properties of a group operation, which means that the codewords form a group.

Group code property: the minimum code distance between code vectors is equal to the minimum weight of non-zero vectors. The weight of the code vector is equal to the number of ones in the code combination.

It is convenient to specify group codes using matrices, the dimension of which is determined by the parameters k and n. The number of rows is k and the number of columns is n = k+m.

The codes generated by these matrices are called (n, k)-codes, and the corresponding matrices are called generators (generators).

In this article we'll talk about the HEngine algorithm and the implementation of a solution to the problem of calculating the Hamming distance on large amounts of data.

Introduction

The Hamming distance is the number of different positions for strings of the same length. For example, HD(100, 001) = 2.

The problem of calculating the Hamming distance was first posed by Minsky and Papert in 1969, where the problem was to find all rows from a database that are within a given Hamming distance to the query one.

A task like this is incredibly simple, but finding it effective solution is still on the agenda.

The Hamming distance is already quite widely used for various tasks, such as searching for close duplicates, pattern recognition, document classification, error correction, virus detection, etc.

For example, Manku and colleagues proposed a solution to the problem of duplicate clustering in web document indexing based on calculating the Hamming distance.
Miller and friends also proposed the concept of searching for songs based on a given audio fragment.
Similar solutions have been used for the problem of image retrieval and retinal recognition, etc.

Description of the problem

There is a database binary strings T, size n, where the length of each line m. Requested string a and the required Hamming distance k.

The task comes down to finding all strings that are within the distance k.

The original concept of the algorithm considers two variants of the problem: static and dynamic.

In a static problem, the distance k is predetermined.
- In dynamic, on the contrary, the required distance is unknown in advance.

The article describes the solution to only a static problem.

Description of the HEngine algorithm for a static task

This implementation focuses on finding strings within k <= 10.

There are three solutions to a static problem: linear scan, query expansion, and table expansion.

In this case, under query extension This refers to the generation of all possible string variants that fit within a given distance for the original string.
Base expansion data implies the creation of many copies of this database, where either all possible options that meet the requirements of the required distance are also generated, or the data is processed in some other way (more on this a little later.).

HEngine uses a combination of these three techniques to effectively balance between memory and execution time.

A little theory

The algorithm is based on a small theorem that states the following:

If for two lines a And b distance HD( a, b) <= k, then if we divide the lines a And b into substrings using the method rcut using segmentation factor
r >= ⌊k/2⌋ + 1
there will definitely be at least q= r − ⌊k/2⌋ substrings, when their distance does not exceed one, HD( a i, b i)<= 1.

Extracting substrings from the base string using the method rcut is carried out according to the following principles:
The value named segmentation factor, which satisfies the condition
r >= ⌊k/2⌋ + 1

First length r − (m mod r) substring will have length ⌊ m / r⌋, and the latter m mod r substrings ⌈ m/r⌉. Where m is the length of the string, ⌊ is rounded to the nearest bottom, and ⌉ is rounded to the nearest above.

Now the same thing, only with an example:

Given two binary strings of length m= 8 bits: A = 11110000 and B = 11010001, distance between them k = 2.
Choosing a segmentation factor r= 2 / 2 + 1 = 2, i.e. there will be a total of 2 substrings in length m/r= 4 bits.

a1 = 1111, a2 = 0000
b1 = 1101, b2 = 0001

If we now calculate the distance between the corresponding substrings, then at least q= 2 - 2/2 = 1, one substring will match or their distance will not exceed one.

What we see:
HD(a1, b1) = HD(1111, 1101) = 1
And
HD(a2, b2) = HD(0000, 0001) = 1

The base string substrings were named signatures.
Signatures or substrings a1 and b1 (a2 and b2, a3 and b3 ..., a r and b r) are called compatible with each other, and if their number of different bits is no more than one, then these signatures are called matching.

And the main idea of the HEngine algorithm is to prepare the database in such a way as to find matching signatures and then select those rows that are within the required Hamming distance.

Database Preprocessing

We already know that if we correctly divide a string into substrings, then at least one substring will match the corresponding substring or the number of different bits will not exceed one (the signatures will match).

This means that we do not need to carry out a complete search over all rows from the database, but need to first find those signatures that match, i.e. substrings will differ by a maximum of one.

But how to search by substrings?

The binary search method should do a good job of this. But it requires the list of strings to be sorted. But we get several substrings from one line. To perform a binary search on a list of substrings, each such list must be sorted in advance.
Therefore, this suggests a method for expanding the database, i.e., creating several tables, each for its own substring or signature. (This table is called table of signatures. And the collection of such tables is set of signatures).

The original version of the algorithm also describes rearranging substrings so that the selected substrings are in first place. This is done more for ease of implementation and for further optimization of the algorithm:

There is a string A, which is divided into 3 substrings, a1, a2, a3, the complete list of permutations will be accordingly:
a1, a2, a3
a2, a1, a3
a3, a1, a2

These signature tables are then sorted.

Implementation of search

At this stage, after preliminary processing of the database, we have several copies of the sorted tables, each for its own subrow.

Obviously, if we want to first find substrings, we need to obtain signatures from the requested string in the same way that was used to create the signature tables.

We also know that the required substrings differ by at most one element. And to find them you will need to use the query expansion method.

In other words, for the selected substring, it is required to generate all combinations, including this substring itself, in which the difference will be at most one element. The number of such combinations will be equal to the length of the substring + 1.

Such actions must be performed for all subrows and for all tables.

And at the very end, you will need to filter out those lines that do not fit within the specified Hamming distance limit. Those. perform a linear search through the found strings and leave only those strings that meet the condition HD( a, b) <= k.

Bloom filter

If, before a binary search for a substring in a table, the filter returns that the substring is not in that table, then there is no point in performing the search.

Accordingly, you need to create one filter for each signature table.

The authors also note that using the Bloom filter in this way reduces query processing time by an average of 57.8%. On my test samples, such a filter has virtually no effect on the resulting running time. Appreciable only on a randomly generated database.

Now the same thing, only with an example

There is a database of binary strings 8 bits long:
11111111
10000001
00111110

The task is to find all lines where the number of different bits does not exceed 2 to the target line 10111111.

So the required distance k = 2.

1. Select a segmentation factor.

Based on the formula, we select the segmentation factor r= 2 and that means there will be only two substrings from one line.

2. Create a set of signatures.

Since the number of substrings is 2, only 2 tables need to be created:

3. We save the substrings in the appropriate tables while maintaining a link to the original source.

T1 T2
1111 1111 => 11111111
1000 0001 => 10000001
0011 1110 => 00111110

4. Sort the tables. Each one separately.

T1
0011 => 00111110
1000 => 10000001
1111 => 11111111

T2
0001 => 10000001
1110 => 00111110
1111=> 11111111

This completes the pre-processing. And let's start the search.

5. We receive the signatures of the requested string.

The searched string 10111110 is divided into signatures. It turns out 1011 and 1100, respectively, the first for the first table, and the second for the second.

6. Generate all combinations that differ by one.
The number of options will be 5.

6.1 For the first substring 1011:
1011
0 011
11 11
100 1
1010

6.2 For the second substring 1100:
1100
0 100
10 00
111 0
1101

7. Binary search.

7.1 For all generated variants of the first substring 1011, we perform a binary search in the first table for a complete match.

1011
0011 == 0011 => 00111110
1111 == 1111 => 11111111
1001
1010

Two substrings found.

7.2 Now, for all variants of the second substring 1100, we perform a binary search in the second table.

1100
0100
1000
1110 == 1110 => 00111110
1101

One substring found.

8. Combining the results into one list:
00111110
11111111

9. Linearly check for compliance and filter out those that do not match the condition<= 2:

HD(10111110, 00111110) = 1
HD(10111110, 11111111) = 2

Both strings satisfy the condition that no more than two elements differ.

Although a linear search is performed at this stage, it is expected that the list of candidate strings will not be large at all.
Under conditions where the number of candidates is large, it is proposed to use the recursive version of HEngine.

Clearly

Figure 1 shows an example of how the search algorithm works.
For a line length of 64 and a distance limit of 4, the segmentation factor is 3, corresponding to only 3 substrings per line.
Where T1, T2 and T3 are signature tables containing only substrings B1, B2, B3.

The requested string is divided into substrings. Next, a range of signatures is generated for the corresponding substrings. Finally, a binary search is performed.

Fig 1. A simplified version of processing queries to signature tables.

results

Average-case complexity of such an algorithm O(m(log n+ 1)), where n is the total number of rows in the database, m is the number of binary searches, and log n+ 1 binary search.
In extreme cases it can exceed linear. For example, provided q= 1 and when all rows from all signature tables, except the last one, match the requested one, then it turns out O((r - 1)mn(log n + 1)).

It is noted that this approach uses 4.65 less memory and is 16% faster than the previous work described in .

Implementation

All this is certainly tempting, but until you actually touch it, it’s hard to assess the scale.
A HEngine prototype was created and tested on available real data.

Tests$ ./matches 7 data/db/table.txt data/query/face2.txt Reading the dataset ........ done. 752420 db hashes and 343 query hashes. Building with 7 hamming distance bound ....... done. Building time: 12.964 seconds Searching HEngine matches ....... found 100 total matches. HEngine query time: 0.228 seconds Searching linear matches ....... found 100 total matches. Linear query time: 6.828 seconds

The results were encouraging, because searching for 343 hashes from a database of 752420 takes ~0.2 seconds, which is 30 times faster than a linear search.

It would seem that we could stop here. But I really wanted to try to use it somehow in a real project.

One click before real application

There is a database of image hashes and a backend in PHP.
The task was to somehow connect the functionality of HEngine and PHP.
It was decided to use FastCGI, the posts habrahabr.ru/post/154187/ and habrahabr.ru/post/61532/ helped me a lot with this.

From PHP just call:

$list = file_get_contents("http://fcgi.local/?" . $hashes);

Which in about 0.5 seconds returns the result. When a linear search requires 9 seconds, and through queries to MySQL at least 20 seconds.

Thanks to everyone who completed it.

) in a vector space of code sequences, in this case the Hamming distance between two binary sequences (vectors) and length is the number of positions in which they are different - in this formulation, the Hamming distance is included in the Dictionary of Algorithms and Data Structures of the National Institute of Standards of the United States (eng. NIST Dictionary of Algorithms and Data Structures ).

Thus, the Hamming distance between vectors 0 011 1 and 1 010 1 is 2 (different bits are marked in red). Subsequently, the metric was generalized to q-ary sequences: for a pair of strings “election” and “fence” the Hamming distance is equal to three.

In general, the Hamming distance for objects and dimensions is given by the function:

The Hamming distance has the properties of a metric, satisfying the following conditions:

Hamming distance in bioinformatics and genomics

Literature

Richard W. Hamming. Error-detection and error-correcting codes, Bell System Technical Journal 29(2):147-160, 1950.
Richard Bleichut. Theory and practice of error control codes. M., "Mir", 1986

Links

Richard Hamming and the beginning of coding theory // Virtual Computer Museum

Wikimedia Foundation.

2010.

See what “Hamming distance” is in other dictionaries: Hamming distance

- Hamming distance The distance d (u,v) between two code sequences u and v of the same length, equal to the number of symbols in which they differ. A block code with a minimum Hamming distance d allows one to detect (d 1) and... ... code distance - Minimum Hamming distance taken over all lares of different codewords in a uniform code. [Collection of recommended terms. Issue 94. Theory of information transmission. Academy of Sciences of the USSR. Technical Terminology Committee. 1979] Topics theory... ...

In the fields of mathematics and information theory, a linear code is an important type of block code used in error detection and correction schemes. Linear codes, compared to other codes, allow the implementation of more efficient algorithms... ... Wikipedia

Detection of errors in communication technology is an action aimed at monitoring the integrity of data when recording/reproducing information or when transmitting it over communication lines. Error correction (error correction) recovery procedure... ... Wikipedia

Hamming distance

The American mathematician Hamming studied what this code depends on, whether it will detect errors and when it can correct them. Intuitively it is clear that this depends on how the code words are separated from each other and how many errors can appear in the transmitted word. We will now formalize the following idea. When encoding, it is necessary to coordinate the number of possible errors in the transmitted codeword so that when the transmitted codeword is changed, it remains closer to the original codeword than to any other codeword.

Definition 13.1. Consider the set of all binary words in the alphabet IN= (0,1) length T distance d(x, at), which is equal to the number of non-matching positions of these words. For example, For words X = 011101, at= 101010 distance is d(x, y) = 5. This distance is called See what “Hamming distance” is in other dictionaries: .

It can be shown that the Hamming distance satisfies the axioms of metric space:

1) d(x, at) ≥ 0, d(x, at) = 0 if and only if X = y;

2) d(x, y) = d(y, x);

3) d(x, at) ≤ d(x, z) +d(z, at) - triangle inequality.

Theorem 13.1(about detection code). The code is detecting in the case when the transmitted word contains no more than k

d(b 1, b 2) ≥ k+ 1.

Theorem 13.2(about the correction code.). The code corrects all errors in the case when the transmitted word contains no more than k errors, if and only if the smallest distance between codewords

d(b 1, b 2) ≥ 2k+ 1.

Proof. The proofs of these theorems are similar. Therefore, we will prove only the last theorem.

Adequacy. Let for any code words we have d(b 1, b 2) ≥ 2k+ 1. If, when transmitting a code word b 1no more happened k errors, then for the accepted word with we have d(b 1, c) ≤ k. But from the triangle inequality for any other codeword b 2we have d(b 1, With) + d(c, b 2) ≥ d(b 1, b 2) ≥ 2 k+ 1. Therefore, from the received word to any other code word, the distance is d(c, b 2) ≥ k + 1, i.e. more than before b 1. Therefore, according to the accepted word With you can definitely find the nearest code word b 1 and then decode it.

Necessity. From the opposite. Assume that the minimum distance between codewords is less than 2 k+ 1. Then there are two code words, the distance between them will be d(b 1, b 2) ≤ 2 k. Let when transmitting the word b 1 accepted word With is located between the words b 1, b 2i has exactly k errors. Then d(c, b 1) = k, d (c, b 2) = d(b 1, b 2) – d(c, b 1) ≤ k. Thus, from the word c it is impossible to unambiguously reconstruct the code word that was transmitted, b 1or b 2. We came to a contradiction.

Example 13.3 . Consider the following five-bit codes for words of length 2 in the alphabet IN = {0,1}:

b 1= K(00) = 00000, b 2= K(01) = 01011,

b 3= K(10) = 10101, b 4= k(11) =11110.

The minimum distance between different codewords is d(bi, bj) = 3. By virtue of the first theorem about the detection code, this code is capable of detecting no more than two errors in a word. By virtue of the second theorem, the code is capable of correcting at most one error in a word.

Group codes

Let's take a closer look at the codes of words in the alphabet IN= (0, 1). If for words of length T code words of length are used n, then we will call such codes ( T , P)-codes. Total words length m equals 2 m. To set ( T , P)-code, you can list code words for all possible words of length m, as in the previous example. A more economical way to specify code words is a matrix task.

In this case, the generating matrix is specified G= ∣∣ gij∣∣ order T × P from 0 and 1. Code words are determined each time by word A = A 1a 2... at by multiplying this word on the left, as a vector, by the generating matrix

Here addition is defined modulo 2. In order for different words to correspond to different code words, it is enough to have in the matrix G unit basis minor of order T, for example the far left one.

Example 13.4 . Consider the generating matrix

This matrix defines the (3, 4) code. In this case, the first three characters in the code word are informational, and the fourth is a control one. It is equal to 0 if there is an even number of ones in the source word, and 1 if there is an odd number of ones. For example, for the word A= 101 code will be b= aG= 1010. The minimum Hamming distance between codewords is d(bi, bj) = 2. Therefore, this is code that detects a one-time error.

Definition 13.2. The code is called group , if the set of all codewords forms a group. The number of units in the word a is called scales words and is denoted If b- code word and word received in the communication channel With = b + e, then the word e called vector of errors .

Theorem 13.3. Let there be a group ( T , P)-code. Then the commutative group TO of all codewords is a subgroup of the commutative group WITH all words of length P, which can be received in the communication channel. The smallest distance between codewords is equal to the smallest weight of a non-zero codeword and

Consider the factor group S/K. Cosets here will be determined by the shift e + b, b∈ K.

As a representative of the coset class, we choose the element with the least weight. We will call such elements adjacent class leaders .

If leaders are interpreted as error vectors, then each adjacent class is a set of distorted words in a communication channel with a fixed error vector, in particular when e= 0 we have an adjacent class of words without distortions, i.e., the set of all code words. The process of word correction and decoding With consists in searching for the adjacent class to which the word belongs With = e + b. Error vector e determines the number and location of errors, code word b determines the correction of the received word.

To facilitate the search for a coset and thus the error vector, Hamming proposed using group codes with special generating matrices.

Hamming codes

Let us consider the construction of Hamming ( T , P)-code with the smallest codeword weight equal to 3, i.e., a code that corrects one error.

Let's put P = 2 r– 1 and let each code word contain r control characters, and T characters ( T = P – r= 2 r– 1– r) - informational, r≥ 2, for example (1, 3) code, (4, 7) code, etc. Moreover, in each code word b= b 1b 2... b p symbols with indices equal to a power of 2 will be control symbols, and the rest will be information symbols. For example, for a (4, 7) code in the codeword b= b 1b 2b 3b 4b 5b 6b 7 characters b 1b 2b 4 will be control ones, and the symbols b 3b 5b 6b 7- informational. To specify the generator matrix G Hamming's ( T , P)-code, consider the auxiliary matrix M order r× P, Where P = 2 r– 1, such that in each j matrix column M there will be binary symbols j, for example, for the (4, 7) code the matrix M will be 3 × 7:

We define the set of all code words as the set of solutions to a homogeneous system of linear algebraic equations of the form

b MT= 0.

For example, for a (4, 7) code such a system would be:

Let us choose a natural basis minor of the system b MT= 0, standing in columns with numbers equal to the power of 2. Thus, we divide the variables into basic (code) and free (information). Now, having defined free variables, it is easy to obtain basic ones. Let's find the fundamental system m= P – r solutions of this homogeneous system. Then any solution to the system is a linear combination of these m decisions. Therefore, writing out line by line m solutions of the fundamental system in the form of a matrix G size m× P, we obtain the generating matrix of the Hamming group ( T , P) code, for example, for a (4, 7) code, the fundamental system of solutions will be 4 = 7 – 3 of the following solutions of a homogeneous system:

g 1= 1110000, g 2= 1001100, g 3= 0101010, g 4= 1101001.

Any linear combination of these solutions will be a solution, i.e. a code word. Let us compose a generating matrix from these fundamental solutions

Now according to any word A length T= 4 easy to calculate code word b length P= 7 using the generator matrix b= aG. At the same time, the symbols b 3, b 5, b 6, b 7 will be informational. They coincide with A 1, A 1, A 3, A 4.Symbols b 1, b 2, b 4 will be control.

Conclusion. Hamming codes are convenient because cosets are easily determined during decoding. Let the word received over the communication channel be With = e + b, Where e- error, b- a codeword. Then multiply it by the auxiliary matrix cMT= (e + b)MT= eM T. If eM T= 0, then the word With- code and we consider: there is no error. If eM T≠ 0, then the word e defines an error.

Recall that the constructed Hammings ( T , P)-code identifies one error. Therefore the error vector e contains one unit in i positions. Moreover, the number i position is obtained in binary representation as the result eM T, coinciding with i matrix column M. It remains to change the symbol i in the word c received over the channel, cross out the control characters and write down the decoded word.

For example, let the accepted word be With= 1100011 for the (4, 7) Hamming code. Let's multiply this word by the matrix M T. We get

(1100011)M T=(010).

Therefore there is an error in the second character. Therefore the code word will be b= 1000011. Cross out the control characters b 1, b 2, b 4.The decoded word will be A = 0011.

Of course, if the error was made in more than one character, then this code will not correct it.