Boyer Moore Algorithm For Pattern Searching In C++

Pattern recognition is an important problem in the field of computer science. Pattern searching methods display the search results when we search for a string in a notepad/word document, browser, or database. The following is an example of a problem statement:

Given a string text[0..n-1] and a pattern[0..m-1] , write a function search(char pattern, char text) that outputs all instances of the pattern in text . You can assume that n > m . Before we get into how to search for or identify occurrences of a given pattern in a string, let's first go over what strings are.

A String represents an immutable data type that holds character sequences. Strings are the most often used data type in any computer language. A string is easily formed using quotes (either single or double quotes).

We can now apply the Boyer Moore pattern searching method (or B-M algorithm) to find the pattern within the string. The Boyer-Moore algorithm's concept is simple: two pointers are aligned at the 0th position of the text string and the corresponding character string. If a character differs from another, the Boyer-Moore algorithm moves the characters simultaneously in two ways. So, we may say that the Boyer-Moore algorithm is a hybrid of two techniques:

  • Bad Character Heuristic
  • Good Suffix Heuristic.
  • Method 1: The Bad Character Heuristic

If a match is found, the pattern's starting index is returned. Otherwise, there are two possibilities:

When the Pattern contains a mismatched character from the input text

In such instances, the character is referred to as a bad character . When a faulty character is identified, we shall move the pattern until it matches the mismatched text characters.

Consider the situation where the pattern string and the input text do not match directly. Let's say the pattern string is "WORLD" and the input text is "HELLOHELLO" . We will use the Boyer-Moore string searching technique to locate a pattern that matches the input text.

Example:

Pattern: WORLD

String: HELLOHELLO

Now, let us use the Boyer-Moore technique to compare the pattern to the input text:

Step 1: Examine the final two characters, D and O . They are incompatible.

In this example, there is no mismatched character in the pattern within the input text. Using a bad character rule, the computer changes the pattern to correspond with the mismatched character.

After the Shift:

Pattern: WORLD

Input Text: HELLOHELLO

Step 2: Match the final two characters, D and L . They are not compatible.

Change the pattern by length and run the bad character condition again.

After the Shift:

Pattern: WORLD

Input Text: HELLOHELLO

This process is repeated until the pattern is tested against the whole input text and no match is found, indicating the identified pattern cannot be found. In this example, the pattern "WORLD" does not correspond to the input string "HELLOHELLO" .

Filename: PattrernMatch.cpp

Example

#include <bits/stdc++.h>
using namespace std;
# define NO_OF_CHARS 256 
void badCharHeuristicMethod( string st, int len, int badch[NO_OF_CHARS])
{
	int i;
	// declaring with -1 for each value
	for (i = 0; i < NO_OF_CHARS; i++)
		badch[i] = -1;

	// filling up the exact value of the last occurrence
	for (i = 0; i < len; i++)
		badch[(int) st[i]] = i;
}

void search_pattern(string tx, string pattern)
{
	int m = pattern.size();
	int n = tx.size();
	int badch[NO_OF_CHARS];

	/* Fill the bad characters array with the supplied pattern by running the initial processing method badCharHeuristicMethod() */
	badCharHeuristicMethod(pattern, m, badch);

	int sl = 0; 
	while(sl <= (n - m))
	{
		int j = m - 1;
 /*As long as the characters in the pattern and the text match at this shift sl, continue to decrease the pattern's index j.*/
		while(j >= 0 && pattern[j] == tx[sl + j])
			j--;

		/*Index j will become -1, continuing the previously mentioned loop if the sequence appears at the current shift.*/
		if (j < 0)
		{
			cout << "The pattern present at the position = " << sl << endl;
			sl += (sl + m < n)? m-badch[tx[sl + m]] : 1;
		}
		else
			sl += max(1, j - badch[tx[sl + j]]);
	}
}
int main()
{
	string text= "HelloWorld";
	string pattern = "World";
	search_pattern(text, pattern);
	return 0;
}

Output:

Output

The pattern present at the position = 5

Method 2: Good Suffix Heuristic.

Let's look at an input string, "HelloWorld", with the pattern as "WORLD" To demonstrate how the Boyer-Moore method uses the Good Suffix Heuristic .

Pattern: WORLD

Input : HELLOWORLD

Now, the Good Suffix Heuristic method will be used to compare the pattern with the input text:

  • Compare the final two characters, D and D .
  • Compare the following two characters, L and L , by moving to the left.
  • Keep going until all the characters in the pattern have been matched.
  • Finally, we have found a match beginning at index 5 in the input string "HELLOWORLD" because every character in the pattern corresponds to a character in the text.

Filename: PatternMatch2.cpp

Example

#include <bits/stdc++.h>
using namespace std;
# define NO_OF_CHARS 256 

void preprocess(int *pos, int *bpos, char *pattern, int l)
{
	//l is the length of the pattern
	int i=l, j=l+1;
	bpos[i]=j;

	while(i>0)
	{
		/*if the letter at index i-1 is not the same as the character at the index j-1, then proceed to search for the boundary to the pattern's right.*/
		while(j<=l && pattern[i-1] != pattern[j-1])
		{
			
			if (pos[j]==0)
				pos[j] = j-i;

			// updating the position
			j = bpos[j];
		}
		
		i--;j--;
		bpos[i] = j;
	}
}

void preprocess2(int *pos, int *bpos,
					char *pattern, int l)
{
	int i, j;
	j = bpos[0];
	for(i=0; i<=l; i++)
	{
		/* assign the border position of the pattern's initial character to all indexes in array shift with pos[i] = 0 */
		if(pos[i]==0)
		pos[i] = j;

		/* If the suffix is smaller than bpos[0], consider the position of the next's largest border as a value of j */
		if (i==j)
			j = bpos[j];
	}
}

/*Use the Boyer Moore method with the Good suffix rule to find a pattern in a given text.*/
void searching(char *string, char *pattern)
{
	// s is shift of the pattern with respect to text
	int st=0, j;
	int l= strlen(pattern);
	int n = strlen(string);

	int bpos[l+1], pos[l+1];

	// declaring all initial positions as 0
	for(int i=0;i<l+1;i++) pos[i]=0;

	// processing methods
	preprocess(pos, bpos, pattern, l);
	preprocess2(pos, bpos, pattern, l);

	while(st <= n-l)
	{

		j = l-1;

		/* decreasing the index of j if the characters match*/
		while(j >= 0 && pattern[j] == string[st+j])
			j--;

		
			/* if the given pattern is at the current position, then the index j will be reduced to -1 */
		if (j<0)
		{
			printf("The pattern found atposition = %d\n", st);
			st += pos[0];
		}
		else

			st += pos[j+1];
	}

}

//Main section
int main()
{
	char string[] = "HelloWorldHello";
	char pattern[] = "Hello";
	searching(string, pattern);
	return 0;
}

Output:

Output

The pattern found at position = 0
The pattern found at position = 10

Input Required

This code uses input(). Please provide values below: