Pattern identification is a crucial challenge within the realm of computer science. Techniques for pattern retrieval exhibit the outcomes when querying a string within applications like a notepad, word document, web browser, or database. The subsequent excerpt illustrates a typical scenario:
Create a function named search(char pattern, char text) that identifies and returns all occurrences of a specified pattern within a text. It is assumed that the length of text is greater than the length of the pattern. Before delving into the methodology of locating or recognizing instances of a particular pattern within a string, let's begin by understanding the concept of strings.
A String denotes an unchangeable data type that stores sequences of characters. Strings are the most commonly utilized data type in any programming language. Creating a string is simple by enclosing the characters within single or double quotation marks.
We can employ the Boyer Moore pattern search technique (also known as B-M algorithm) to locate a specific pattern within a given string. The fundamental idea behind the Boyer-Moore algorithm is straightforward: it involves positioning two pointers at the beginning of the text string and the corresponding pattern string. In cases where there is a mismatch between characters, the Boyer-Moore algorithm shifts both pointers in a strategic manner. Therefore, we can describe the Boyer-Moore algorithm as a fusion of two strategies:
- Bad Character Heuristic
- Good Suffix Heuristic.
Method 1: The Bad Character Heuristic
If a match is discovered, the index where the pattern begins is returned. If not, there are two potential outcomes:
When the Pattern contains a mismatched character from the input text
In such cases, the character causing issues is known as a bad character. Once we detect a problematic character, we will shift the pattern until it aligns with the incorrect text characters.
In a scenario where the pattern string and input text do not align directly, for instance, with the pattern string being "WORLD" and the input text being "HELLOHELLO", we can employ the Boyer-Moore algorithm for string searching to identify a matching pattern within the input text.
Example:
Pattern: WORLD
String: HELLOHELLO
Now, we will apply the Boyer-Moore algorithm to match the pattern with the given input text:
Examine step one: Evaluate the last two characters, D and O, which are not compatible.
In this instance, there are no mismatched characters in the pattern present in the provided text. By utilizing a bad character rule, the system adjusts the pattern to align with the mismatched character.
After the Shift:
Pattern: WORLD
Input Text: HELLOHELLO
Match the last two characters, D and L, which are incompatible.
Modify the pattern based on its length and then reapply the bad character condition.
After the Shift:
Pattern: WORLD
Input Text: HELLOHELLO
This procedure is iterated until the pattern is compared against the entire input text without any matches, signifying that the specific pattern is not present. For instance, the pattern "WORLD" does not align with the input sequence "HELLOHELLO".
Filename: PattrernMatch.cpp
#include <bits/stdc++.h>
using namespace std;
# define NO_OF_CHARS 256
void badCharHeuristicMethod( string st, int len, int badch[NO_OF_CHARS])
{
int i;
// declaring with -1 for each value
for (i = 0; i < NO_OF_CHARS; i++)
badch[i] = -1;
// filling up the exact value of the last occurrence
for (i = 0; i < len; i++)
badch[(int) st[i]] = i;
}
void search_pattern(string tx, string pattern)
{
int m = pattern.size();
int n = tx.size();
int badch[NO_OF_CHARS];
/* Fill the bad characters array with the supplied pattern by running the initial processing method badCharHeuristicMethod() */
badCharHeuristicMethod(pattern, m, badch);
int sl = 0;
while(sl <= (n - m))
{
int j = m - 1;
/*As long as the characters in the pattern and the text match at this shift sl, continue to decrease the pattern's index j.*/
while(j >= 0 && pattern[j] == tx[sl + j])
j--;
/*Index j will become -1, continuing the previously mentioned loop if the sequence appears at the current shift.*/
if (j < 0)
{
cout << "The pattern present at the position = " << sl << endl;
sl += (sl + m < n)? m-badch[tx[sl + m]] : 1;
}
else
sl += max(1, j - badch[tx[sl + j]]);
}
}
int main()
{
string text= "HelloWorld";
string pattern = "World";
search_pattern(text, pattern);
return 0;
}
Output:
The pattern present at the position = 5
Method 2: Good Suffix Heuristic.
To illustrate how the Boyer-Moore algorithm leverages the Good Suffix Heuristic, let's examine an input string "HelloWorld" with the pattern "WORLD".
Pattern: WORLD
Input : HELLOWORLD
Now, the Good Suffix Heuristic method will be used to compare the pattern with the input text:
- Compare the final two characters, D and D .
- Compare the following two characters, L and L , by moving to the left.
- Keep going until all the characters in the pattern have been matched.
- Finally, we have found a match beginning at index 5 in the input string "HELLOWORLD" because every character in the pattern corresponds to a character in the text.
Filename: PatternMatch2.cpp
#include <bits/stdc++.h>
using namespace std;
# define NO_OF_CHARS 256
void preprocess(int *pos, int *bpos, char *pattern, int l)
{
//l is the length of the pattern
int i=l, j=l+1;
bpos[i]=j;
while(i>0)
{
/*if the letter at index i-1 is not the same as the character at the index j-1, then proceed to search for the boundary to the pattern's right.*/
while(j<=l && pattern[i-1] != pattern[j-1])
{
if (pos[j]==0)
pos[j] = j-i;
// updating the position
j = bpos[j];
}
i--;j--;
bpos[i] = j;
}
}
void preprocess2(int *pos, int *bpos,
char *pattern, int l)
{
int i, j;
j = bpos[0];
for(i=0; i<=l; i++)
{
/* assign the border position of the pattern's initial character to all indexes in array shift with pos[i] = 0 */
if(pos[i]==0)
pos[i] = j;
/* If the suffix is smaller than bpos[0], consider the position of the next's largest border as a value of j */
if (i==j)
j = bpos[j];
}
}
/*Use the Boyer Moore method with the Good suffix rule to find a pattern in a given text.*/
void searching(char *string, char *pattern)
{
// s is shift of the pattern with respect to text
int st=0, j;
int l= strlen(pattern);
int n = strlen(string);
int bpos[l+1], pos[l+1];
// declaring all initial positions as 0
for(int i=0;i<l+1;i++) pos[i]=0;
// processing methods
preprocess(pos, bpos, pattern, l);
preprocess2(pos, bpos, pattern, l);
while(st <= n-l)
{
j = l-1;
/* decreasing the index of j if the characters match*/
while(j >= 0 && pattern[j] == string[st+j])
j--;
/* if the given pattern is at the current position, then the index j will be reduced to -1 */
if (j<0)
{
printf("The pattern found atposition = %d\n", st);
st += pos[0];
}
else
st += pos[j+1];
}
}
//Main section
int main()
{
char string[] = "HelloWorldHello";
char pattern[] = "Hello";
searching(string, pattern);
return 0;
}
Output:
The pattern found at position = 0
The pattern found at position = 10