Interactive Programming Education Platform

Logic Practice

In this guide, we are going to explore the Knuth-Morris-Pratt (KMP) algorithm in C++ along with its code implementation. Among the various algorithms used for pattern matching, we have the Naive Algorithm and the Rabin-Karp Algorithm. When comparing these algorithms, the Naive approach and Rabin-Karp have a time complexity of O((n-m)*m); however, Rabin-Karp usually performs better than the Naive method. On the contrary, the KMP algorithm works with a time complexity of O(n) and requires O(m) additional space. In this scenario, the lengths of the pattern string and the text string are denoted as m and n, respectively.

The straightforward method of pattern matching, where n represents the text's length and m represents the pattern's length, operates in O(n.m) time, as indicated. This time complexity arises from the algorithm's inability to remember previously matched characters, resulting in the continuous comparison of individual characters with different pattern characters.

The clever utilization of the existing comparison information is the key to how the KMP technique functions. By avoiding redundant comparisons between text and pattern symbols that have already been matched, it achieves pattern recognition in linear O(n) time complexity. To analyze the pattern's composition, it relies on a partial match table, which requires O(m) time to construct. Consequently, the total time complexity of the KMP algorithm amounts to O(m + n).

Create a function named patternSearch(char pattern, char text) designed to identify and display all occurrences of a specified pattern within a given text. The function should take in two parameters: the pattern array (pattern) and the text array (text). It is assumed that the length of the text array is greater than the length of the pattern array.

Within the function, implement a search algorithm that scans the text array to locate all instances where the pattern appears. By iterating through the text array and comparing it with the pattern array, the function should be able to pinpoint and output each occurrence of the pattern within the text.

Ensure that the function is structured to effectively handle scenarios where the pattern may occur multiple times within the text. This involves systematically identifying and presenting each instance of the pattern without overlooking any occurrences within the given text data.

Examples:

Example


Enter "THIS IS A TEST TEXT" and "TEST" as the inputs.
Pattern discovered at index 10
Enter this value for txt[]: "AABAACAADAABAABA"
pat[] equals "AABA."

Output:

A pattern was identified at position 0, succeeded by patterns at position 9 and position 12.

An essential aspect of computer science involves pattern matching. Pattern-matching techniques reveal the search outcomes following a string query within various platforms like databases, browsers, or text editing applications.

In the previous discussion, we covered the Naive pattern-matching algorithm. The Naive algorithm has a worst-case time complexity of O(m(n-m+1)). In contrast, the worst-case time complexity of the KMP algorithm is O(n+m).

Pattern-Searching using KMP (Knuth Morris Pratt)

When multiple identical characters are succeeded by a different character, the Naive pattern-matching algorithm encounters difficulties in identifying patterns.

For example,

✅ text = "AAAAAAAAAAAAAB," and 2) pattern = "AAAAB."
✅ pattern = "ABABAC", text = "ABABABCABABABCABABABC" (This is not the most challenging case, but it is a suboptimal situation for the Naive algorithm)

The deteriorating characteristic of the pattern, which arises from the repetition of sub-patterns within it, is leveraged by the KMP matching algorithm to lower the worst-case complexity to O(n+m).

The core principle of the KMP algorithm lies in the assumption that we possess prior knowledge about certain characters within the text window following a mismatch. Leveraging this information enables us to skip comparing characters that we are certain will already match.

Comparison Overview of the text "AAAAABAAABA."

Pat is "AAAA"

We contrast the first text window with the pat

The starting point for the pattern "AAAA" within the text "AAAAABAAABA" is the index where the pattern first appears.

We find a match. This operates in a manner akin to Naive String Matching.

Next, we analyze the subsequent section of the text document against the specified pattern.

the text "AAAAABAAABA"

pat equals "AAAA" [Pattern shifted by one position]

KMP outperforms Naive in this scenario. To verify the pattern fit within the current text window, we specifically examine the fourth occurrence of A in the pattern against the fourth letter in the second window. The initial three characters were skipped as their match was already anticipated.

Preprocessing Required?

As previously stated, an important question arises: how do we calculate the number of characters to skip? To address this, we preprocess the pattern to calculate and store this information in an integer array named lps, which indicates the number of characters to skip.

Overview Of Preprocessing:

To bypass characters during comparison, the Knuth-Morris-Pratt algorithm prepares the pattern in advance and generates an auxiliary lps array of length m (equivalent to the size of the pattern).

The term "lps" represents the longest valid prefix that is also a suffix within a given string. A valid prefix encompasses the complete substring from the beginning of the string up to a certain point. In the case of "ABC," valid prefixes include "", "A", "AB," and "ABC," with the specific valid prefixes being "", "A", and "AB". On the other hand, the string possesses suffixes such as "", "C", "BC", and "ABC".

When exploring subpatterns that involve the lps concept, our focus is directed towards substrings within patterns that serve as both prefixes and suffixes.

In the process, we determine and store the length of the longest matching valid prefix, which also functions as a suffix for the sub-pattern pat[0..i], denoted as lps[i]. This calculation is carried out for each sub-pattern pat[0..i], where i ranges from 0 to m-1.

The maximum valid prefix of pat[0..i] that matches a suffix of pat[0..i] is denoted as lps[i].

It is important to highlight that lps[i] represents both the longest prefix and a valid suffix. It is crucial to apply this correctly in a specific context to prevent the entire substring from being included in the evaluation.

lps construction examples:

The lps array value corresponding to the pattern "AAAA" is [0, 1, 2, 3].

The lps array for the pattern "ABCDE" is [0, 0, 0, 0, 0].

The array Lps contains the values [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5] corresponding to the pattern "AABAACAABAA".

The array lps corresponding to the pattern "AAACAAAAAC" is [0, 1, 2, 0, 1, 2, 3, 3, 4, 5].

The lps array for the pattern "AAABAAA" is [0, 1, 2, 0, 1, 2, 3].

Preprocessing Algorithm:

During this phase, we calculate values in lps. To achieve that, we maintain track of the longest prefix suffix value length for the preceding index using the len variable.

✅ We set lps[0] and len to zero.
✅ If pat[len] and pat[i] are equal, len is increased by 1, and the resulting value is sent to lps[i].
✅ If len is not 0 and pat[i] and pat[len] are not equal, we update len to lps[len-1].
✅ For further information, see computeLPSArray in the code below.

Preprocessing (or lps Creation) Example:

len = 0, i = 0, pat = "AAACAAAA":

✅ Since lps[0] is always 0 and i = 1 follows len = 0, i = 1:
✅ Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
✅ If len = 1 and lps[1] = 1, then len = 1 and i = 2 is obtained:
✅ Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
✅ If len = 2 and lps[2] = 2 and i = 3, then len = 2 and i = 3:
✅ Since len > 0 and pat[len] and pat[i] are incompatible,
✅ Make len equal to lps[len-1] = lps[1] = 1. len = 1, i = 3:
✅ Since len > 0 and pat[len] and pat[i] do not match, len = lps[len-1] = lps[0] = 0 and therefore len = 0, i = 3:
✅ Given that len = 0 and that pat[len] and pat[i] are not equal, set lps[3] = 0 and i = 4 to obtain len = 0, i = 4:
✅ Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
✅ If len = 1 and lps[4] = 1, then i = 5 and len = 1 as well:
✅ Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
✅ Set len to 2, lps[5] to 2, and i to 6.
✅ If len = 2 and i = 6, perform the following steps: perform len++, save the result in lps[i], and perform i++.
✅ Len = 3, lps[6] = 3, and i = 7 such that Len = 3, i = 7
✅ Since len > 0 and pat[len] and pat[i] are not equal,
✅ If len = lps[len-1] = lps[2] = 2, then i = 7 and len = 2:
✅ Do len++, save it in lps[i], and do i++ since pat[len] and pat[i] match.
✅ Len = 3, lps [7] = 3, and i = 8
✅ We end here because the entire LP has been built.
✅ len = 1, i = 3:

Implementation Of KMP Algorithm:

Instead of shifting the pattern by one and comparing all characters at each shift, as done in the Naive method, we leverage a value stored in lps to identify the next characters that need to be matched. This approach aims to prevent unnecessary matching of characters that are highly likely to match.

How can I use lps to choose the next locations (or determine how many characters need to be skipped)?

✅ We begin comparing the pat[j] characters from the currently active text window, where j = 0.
✅ While pat[j] and txt[i] continue to match, we continue to match the characters txt[i] and pat[j] and increment i and j.
✅ Characters pat[0..j-1] match characters txt[i-j...i-1] when we notice a mismatch. (Remember that j increments its value only when there is a match; otherwise, it starts at 0).
✅ From the definition above, we also know that lps[j-1] is the number of letters in pat[0...j-1] that are both a valid prefix and a proper suffix.
✅ We may infer from the two considerations mentioned earlier that we do not need to match these lps[j-1] characters with txt[i-j...i-1] since we know they will already match. Let's use the example mentioned earlier to grasp this further.

The Method as Mentioned Earlier Is Shown In The Following Way:

Consider txt = "AAAAABAAABA", pat = "AAAA"

If the LPS building process described above is applied, the resulting LPS values are as follows: 0, 1, 2, 3.

✅ At i = 0, j = 0: txt[i] matches pat[j], increment i and j.
✅ At i = 1, j = 1: txt[i] matches pat[j], increment i and j.
✅ At i = 3, j = 3: txt[i] matches pat[j]. Since j = M, the pattern is found, and j is reset for further processing.

We do not match the first three characters of this window here, in contrast to the Naive algorithm. In the previous phase, the value of lps[j-1] provided us with the index of the next character to match.

✅ i = 4, j = 3: do i++, j++ when txt[i] and pat[j] match.
✅ i = 5, j = 4: Since j == M, the print pattern is detected and reset to j, which is then equal to lps[j-1] + lps[3] = 3.
✅ We do not match the first three characters of this window, once again in contrast to the Naive method. In the previous phase, the value of lps[j-1] provided us with the index of the next character to match. i = 5, j = 3: txt[i] and pat[j] are not equal, and j > 0; just j has to be changed. j = lps[j-1] = lps[2] = 2. j = lps[j-1] = lps[1] = 1 when i = 5, j = 2, and txt[i] and pat[j] do not match. j = lps[j-1] = lps[0] = 0 when i = 5 and j = 1 when txt[i] and pat[j] do not match. i = 5, j = 0: Since j is 0 and txt[i] and pat[j] do not match, we do i++. If i = 6 and j = 0, then i++ and j++ must be performed. If i = 7 and j = 1, then i++ and j++ must be performed.
✅ i = 5, j = 3: txt[i] and pat[j] are not equal, and j > 0; just j has to be changed. j = lps[j-1] = lps[2] = 2.
✅ j = lps[j-1] = lps[1] = 1 when i = 5, j = 2, and txt[i] and pat[j] do not match.
✅ j = lps[j-1] = lps[0] = 0 when i = 5 and j = 1 when txt[i] and pat[j] do not match.
✅ i = 5, j = 0: Since j is 0 and txt[i] and pat[j] do not match, we do i++.
✅ If i = 6 and j = 0, then i++ and j++ must be performed.
✅ If i = 7 and j = 1, then i++ and j++ must be performed.

This procedure is repeated until there are adequate characters in the text to be compared with the ones in the pattern.

The implementation of the tactic, as previously indicated, is demonstrated below:

Implementation of KMP Pattern Searching in C++

Example


// C++ program for implementation of KMP pattern searching
// algorithm

#include <bits/stdc++.h>

void computeLPSArray(char* pat, int M, int* lps);

// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
	int M = strlen(pat);
	int N = strlen(txt);

	// create lps[] that will hold the longest prefix suffix
	// values for pattern
	int lps[M];

	// Preprocess the pattern (calculate lps[] array)
	computeLPSArray(pat, M, lps);

	int i = 0; // index for txt[]
	int j = 0; // index for pat[]
	while ((N - i) >= (M - j)) {
		if (pat[j] == txt[i]) {
			j++;
			i++;
		}

		if (j == M) {
			printf("Found pattern at index %d ", i - j);
			j = lps[j - 1];
		}

		// mismatch after j matches
		else if (i < N && pat[j] != txt[i]) {
			// Do not match lps[0..lps[j-1]] characters,
			//They will match anyway
			if (j != 0)
				j = lps[j - 1];
			else
				i = i + 1;
		}
	}
}

// Fills lps[] for given pattern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
	// length of the previous longest prefix suffix
	int len = 0;

	lps[0] = 0; // lps[0] is always 0

	// the loop calculates lps[i] for i = 1 to M-1
	int i = 1;
	while (i < M) {
		if (pat[i] == pat[len]) {
			len++;
			lps[i] = len;
			i++;
		}
		else // (pat[i] != pat[len])
		{
			// This is tricky. Consider the example.
			// AAACAAAA and i = 7. The idea is similar
			// to search step.
			if (len != 0) {
				len = lps[len - 1];

				// Also, note that we do not increment
				// i here
			}
			else // if (len == 0)
			{
				lps[i] = 0;
				i++;
			}
		}
	}
}

// Driver code
int main()
{
	char txt[] = "ABABDABACDABABCABAB";
	char pat[] = "ABABCABAB";
	KMPSearch(pat, txt);
	return 0;
}

Output


Found pattern at index 10

Explanation

In the preceding code, the KMP algorithm was applied in the C++ programming language. The KMP algorithm serves the purpose of pattern matching. As a result, the pattern was identified at position 10 within the provided input, as demonstrated in the output.

Time Complexity:

The time complexity is O(N+M), where N represents the length of the given text, and M indicates the length of the pattern that needs to be identified.

Auxiliary Space:

Kmp Algorithm In C++