In this guide, we will explore McCreight's algorithm in C++ along with its background, coding implementation, and various other aspects.
McCreight's Algorithm Introduction:
McCreight's technique for generating suffix trees is a significant approach in computer science. This particular data structure plays a crucial role in tasks related to manipulating strings and identifying patterns within them. Developed by Edward M. McCreight in 1976, this algorithm efficiently constructs a suffix tree for a given string. Consequently, it enables rapid searches for substrings and various string-related operations.
Problem Statement:
McCreight's algorithm addresses the challenge of constructing a suffix tree for a specified string S. This tree illustrates all potential suffixes of S, enabling the identification of substrings, longest common substrings, and various string-related tasks with ease. The objective is to execute these operations proficiently to ensure minimal time consumption.
History of McCreight's Algorithm:
In 1976, Edward M. McCreight introduced his method in a paper titled "A Space-Economical Suffix Tree Construction Algorithm". The primary importance of this strategy lies in its capacity to generate a suffix tree in linear O(n) time complexity, with n denoting the size of the input string S. This advancement was notable as earlier techniques were comparatively less effective, often requiring O(n2) time.
The contributions of Weiner (1973) and Ukkonen (1995) laid the groundwork for developing suffix trees, which were further advanced by McCreight. To maintain a linear runtime and ensure space usage remains proportional to the size of the input string (n), they drew upon these concepts to devise an algorithm that operates in O(n) time.
What are Suffix Trees?
A suffix tree is a tree-shaped data structure that illustrates all the suffixes of a specified string S with a length of n. Each distinct route from the initial node to any terminal node corresponds to a single suffix.
Suffix trees are capable of efficiently addressing various challenges related to strings, including searching for substrings, identifying the longest common substrings, and matching patterns. However, the rapid construction of these trees for lengthy strings posed a significant challenge until the introduction of McCreight's innovation.
McCreight's Algorithm in Action:
McCreight's algorithm is employed to construct a suffix tree efficiently within linear time complexity. Initially, the algorithm utilizes a straightforward representation and subsequently incrementally incorporates suffixes while maintaining the essential properties of the tree.
Here's a high-level overview of McCreight's algorithm:
- Phase 1: First, create a tree from the first two characters of string S. The suffix tree is built on this.
- Phase 2: Add each next character of S to the tree, one by one. Update the tree at every stage so that it represents all suffixes of the processed substring.
- Updating Tree Structure: Active point and end point concepts are employed to efficiently add new suffixes while preserving properties of a suffix tree. This step entails updating and traversing the structure of a tree with respect to the current processed suffix.
- Time Complexity O(n): McCreight's algorithm runs in linear time i.e., O(n) where n represents length of input string S. To ensure such efficiency, unnecessary re-traversals are avoided using implicit suffix links as well as active points.
C++ Implementation of McCreight's Algorithm:
To implement McCreight's algorithm in C++, we should define necessary data structures for representing a suffix tree and then follow steps stated above. Below is an overview:
- Data Structures: It consists structures for nodes, edges and the suffix tree itself.
- Tree Construction: Initial construction of trees should be implemented including extension phases.
- Updating Tree Structure: Dynamically update functions for adding new suffixes while maintaining its properties as discussed.
- Efficiency Considerations: Ensure that the algorithm operates efficiently in linear time relative to the input size ? .
- Efficiency Considerations: Ensure that the algorithm operates efficiently in linear time relative to the input size ? .
Example:
Let's consider a scenario to demonstrate McCreight's Algorithm in the C++ programming language.
#include <iostream>
#include <string>
#define MAX_CHAR 5 //MAX_CHAR is alphabet size, which includes "$"
#define MARKER '#' //when serializing a tree, put "#" behind node all children of which have already been stored in txt file
using namespace std;
struct treenode {
struct treenode *children[MAX_CHAR];
int start;
int end;
struct treenode *parent;
struct treenode *suffixlink;
treenode() { //constructor
for (int s = 0; s < MAX_CHAR; ++s) {
children[s] = NULL;
}
parent = NULL;
suffixlink = NULL;
start = -1;
end = -1;
}
~treenode() {} //destructor
};
struct node_point {
int pstart;
int pend;
node_point(struct treenode *node) {
pstart = node->start;
pend = node->end;
}
~node_point() {}
};
// Function to jumpdown, return the node where jumpdown ends at
treenode *jumpdown(treenode *node, node_point beta, const string &T) {
treenode *current = node;
int i = 0;
while (beta.pstart <= beta.pend) {
i = 0;
while (T[current->children[i]->start] != T[beta.pstart]) { // choosing the right direction to jumpdown
++i;
}
current = current->children[i];
beta.pstart = beta.pstart + (current->end - current->start) + 1; // update beta.pstart such that beta.pstart is the position where next turn of choosing right direction begins at
}
if (beta.pstart == beta.pend + 1) { // jumpdown stops at node "current"
return current;
} else { // jumpdown stops at an edge, need split the edge and insert new node "w"
treenode *w = new treenode();
w->start = current->start;
w->end = current->end - (beta.pstart - beta.pend) + 1;
w->parent = current->parent;
current->parent->children[i] = w;
current->parent = w;
current->start = w->end + 1;
w->children[0] = current;
return w;
}
}
// Function to walkdown, return the node where walkdown ends at
treenode *walkdown(treenode *node, node_point &gamma, int &p, const string &T) { // p records that the leafnode which is going to be inserted is the p-th children of the node which walkdown ends at
treenode *current = node;
treenode *return_node;
while (gamma.pstart <= gamma.pend) {
int i = 0;
while (current->children[i] != NULL && T[current->children[i]->start] != T[gamma.pstart]) { // choosing the right direction to walkdown
++i;
}
if (!current->children[i]) { // current->children[i] == NULL, indicates walkdown stops at node current
p = i;
return_node = current;
break;
} else {
int intermediate = gamma.pstart + (current->children[i]->end - current->children[i]->start);
int dist = 0;
while (T[current->children[i]->start + dist] == T[gamma.pstart] && gamma.pstart <= min(intermediate, gamma.pend)) {
++dist;
++(gamma.pstart);
}
if (gamma.pstart > min(intermediate, gamma.pend)) { // this only happens when matching till node current->children[i], so continue walkdown
current = current->children[i];
} else { // this only happens when mismatch happens in the middle of an edge, split this edge and add a new node "new_node_v"
treenode * new_node_v = new treenode();
new_node_v->start = current->children[i]->start;
new_node_v->end = current->children[i]->start + dist - 1;
current->children[i]->start = current->children[i]->start + dist;
current->children[i]->parent = new_node_v;
new_node_v->children[0] = current->children[i];
new_node_v->parent = current;
p = 1;
current->children[i] = new_node_v;
return_node = new_node_v;
break;
}
}
}
return return_node;
}
// Function to build the suffix tree
treenode *build_suffix_tree(const string &T, int len) {
// initialize root and leafnode0 for the suffix tree
treenode *root = new treenode();
root->parent = root;
root->suffixlink = root;
treenode *leafnode0 = new treenode();
root->children[0] = leafnode0;
leafnode0->parent = root;
leafnode0->start = 0;
leafnode0->end = len;
//initialize node_v_j in the McCreight lecture
treenode *node_v_j;
node_v_j = root;
treenode *w;
node_point beta = node_point(root);
node_point gamma = node_point(leafnode0);
for (int j = 1; j <= len; ++j) { // now we're going to insert suffix j
treenode *current_leafnode = new treenode(); // "current_leafnode" is the node we are going to insert in this turn
current_leafnode->start = j;
current_leafnode->end = len;
int I = 1; // indicator of whether need jumpdown
if (node_v_j == root) {// no need to jumpdown
++gamma.pstart;
I = 0;
} else if (node_v_j->parent == root) {
if (beta.pstart == beta.pend) {// no need to jumpdown
I = 0;
} else {// need to jumpdown
++beta.pstart;
}
}
if (I == 0) {
// No jumpdown, only need walkdown from the root;
int p = 0;
treenode * new_node_v;
new_node_v = walkdown (root, gamma, p, T);
new_node_v->children[p] = current_leafnode;
current_leafnode->parent = new_node_v;
current_leafnode->start = gamma.pstart;
} else { // need jumpdown;
w = jumpdown(node_v_j->parent->suffixlink, beta, T);
node_v_j->suffixlink = w;
if (!w->children[1]) { // w->children[1] = NULL; w is the split node when jumpdown is finished, so no need for walkdown, inserting leafnode directly
w->children[1] = current_leafnode;
current_leafnode->start = gamma.pstart;
current_leafnode->parent = w;
} else { // jumpdown stops at an existed node w, need walkdown from node "w"
int p = 0;
treenode * new_node_v;
new_node_v = walkdown (w, gamma, p, T);
new_node_v->children[p] = current_leafnode;
current_leafnode->parent = new_node_v;
current_leafnode->start = gamma.pstart;
}
}
//update gamma, beta and node_v_j
gamma = node_point(current_leafnode);
beta = node_point(current_leafnode->parent);
node_v_j = current_leafnode->parent;
}
return root;
}
void printSuffixTree(treenode *node, const string &T, int level = 0) {
if (node == nullptr) return;
// Print the current node
for (int i = 0; i < level; ++i) cout << " ";
if (node->start == -1) cout << "ROOT" << endl;
else cout << "[" << node->start << ", " << node->end << "]" << endl;
// Recursively print children
for (int i = 0; i < MAX_CHAR; ++i) {
if (node->children[i] != nullptr) {
printSuffixTree(node->children[i], T, level + 1);
}
}
}
int main() {
const string T = "banana"; // Example input string
int len = T.length();
treenode *root = build_suffix_tree(T, len);
cout << "Suffix tree construction completed." << endl;
// Print the suffix tree
cout << "Suffix Tree:" << endl;
printSuffixTree(root, T);
return 0;
}
Output:
Suffix tree construction completed.
Suffix Tree:
ROOT
[0, 6]
[1, 1]
[2, 3]
[4, 6]
[6, 6]
[6, 6]
[2, 3]
[4, 6]
[6, 6]
[6, 6]
Explanation:
- Data Structures: treenode: a node in the suffix tree, having pointers to its children, parent and suffix link also indices that represent start and end positions of substring it represents. node_point: point in the suffix tree represented by the start and end indices of a node.
- Jumpdown Function (jumpdown): Takes as input a starting node, a point beta and the input string T and determines where the jump-down operation ends It moves down the tree by taking appropriate child depending on characters in T.
- Walkdown Function (walkdown): Similar to jumpdown but can involve splitting edges if required. Matches characters from input string T against edge s of tree while walking down If mismatch occurs at some point within an edge, then splits that edge into two parts.
- Suffix Tree Construction (buildsuffixtree): It constructs a suffix tree from input string T Root and leaf nodes are initialized during this process. Jump-downs/walk-downs are used to insert suffixes into the tree.
- Print Suffix Tree (printSuffixTree): Recursively prints out all nodes of the given suffix-tree It starts with root node going through each child until reaches leaf-node and then backtracks again printing start/end indices.
- Main Function: It initializes the input string T. It constructs the suffix tree using buildsuffixtree. It prints the constructed suffix tree using printSuffixTree.
- treenode: a node in the suffix tree, having pointers to its children, parent and suffix link also indices that represent start and end positions of substring it represents.
- node_point: point in the suffix tree represented by the start and end indices of a node.
- Takes as input a starting node, a point beta and the input string T and determines where the jump-down operation ends
- It moves down the tree by taking appropriate child depending on characters in T.
- Similar to jumpdown but can involve splitting edges if required.
- Matches characters from input string T against edge s of tree while walking down
- If mismatch occurs at some point within an edge, then splits that edge into two parts.
- It constructs a suffix tree from input string T
- Root and leaf nodes are initialized during this process.
- Jump-downs/walk-downs are used to insert suffixes into the tree.
- Recursively prints out all nodes of the given suffix-tree
- It starts with root node going through each child until reaches leaf-node and then backtracks again printing start/end indices.
- It initializes the input string T.
- It constructs the suffix tree using buildsuffixtree.
- It prints the constructed suffix tree using printSuffixTree.
- McCreight's Algorithm is an algorithm designed to create a suffix tree for a string of length n in O(n) time. This is made possible through linear time complexity by processing each character of the string in an efficient manner, and updating suffix tree structure dynamically with the help of active point and end point ideas.
- Space complexity for McCreight's Algorithm is generally considered as O(n), where n represents the input size. In terms of input string length, this means that it uses space proportional to it; which is needed to keep record about what has been done so far during construction process (suffix representation).
Time and Space Complexities:
Conclusion:
In brief, McCreight's method enhanced the construction of suffix trees by offering a linear-time resolution that was previously unavailable. It enhanced the capabilities of processing strings and made intricate text-based algorithms more manageable. Consequently, this approach plays a crucial role in numerous real-world scenarios as it forms the foundation for different strategies employed in pattern matching within string or file contexts, thereby enabling efficient resolution of computational challenges across diverse domains.