In this article, we will discuss McCreight's algorithm in C++ with its history, implementation, and many others.
McCreight's Algorithm Introduction:
The method McCreight's for constructing suffix trees is an important one. A data structure used in string processing and pattern matching. It was created by Edward M. McCreight in 1976. The algorithm builds a suffix tree quickly for any given string, allowing fast substring searches and other operations on strings.
Problem Statement:
The problem that McCreight's algorithm solves is how to build a suffix tree for a given string S. A suffix tree shows all the possible suffixes of S and can be used to find substrings, longest common substrings, and other things related to strings easily. The task is to do this efficiently so that these operations take minimal time.
History of McCreight's Algorithm:
In the year 1976, Edward M. McCreight released his technique in an article called "A Space-Economical Suffix Tree Construction Algorithm". The main significance of this approach is its ability to construct a suffix tree within linear O(n) time complexity where n refers to the length of input string S which was considered groundbreaking because previous methods were less efficient usually taking time O(n2).
The work of Weiner (1973) and Ukkonen (1995) served as the foundation for the construction of suffix trees, which was built upon by McCreight. In order to still achieve a linear runtime while keeping space usage proportional to n, where n is the size of the input string, they used these ideas as inspiration for an algorithm that works in O(n) time.
What are Suffix Trees?
A suffix tree is a tree-like data structure that represents all the suffixes of a given string S with length n. Each unique path from root node down to any leaf node corresponds to one such suffix.
Suffix trees can solve different problems on strings effectively such as substring searching, longest common substrings or pattern matching but constructing them quickly over large strings was hard before McCreight's invention.
McCreight's Algorithm in Action:
An algorithm called McCreight's is used to build a suffix tree for a lineal time. The algorithm uses simple representation at first and then adds suffixes step-by-step while keeping properties of the tree.
Here's a high-level overview of McCreight's algorithm:
- Phase 1: First, create a tree from the first two characters of string S. The suffix tree is built on this.
- Phase 2: Add each next character of S to the tree, one by one. Update the tree at every stage so that it represents all suffixes of the processed substring.
- Updating Tree Structure: Active point and end point concepts are employed to efficiently add new suffixes while preserving properties of a suffix tree. This step entails updating and traversing the structure of a tree with respect to the current processed suffix.
- Time Complexity O(n): McCreight's algorithm runs in linear time i.e., O(n) where n represents length of input string S. To ensure such efficiency, unnecessary re-traversals are avoided using implicit suffix links as well as active points.
C++ Implementation of McCreight's Algorithm:
To implement McCreight's algorithm in C++, we should define necessary data structures for representing a suffix tree and then follow steps stated above. Below is an overview:
- Data Structures: It consists structures for nodes, edges and the suffix tree itself.
- Tree Construction: Initial construction of trees should be implemented including extension phases.
- Updating Tree Structure: Dynamically update functions for adding new suffixes while maintaining its properties as discussed.
- Efficiency Considerations: Ensure that the algorithm operates efficiently in linear time relative to the input size ? .
- Efficiency Considerations: Ensure that the algorithm operates efficiently in linear time relative to the input size ? .
Example:
Let us take an example to illustrate the McCreight's Algorithm in C++.
#include <iostream>
#include <string>
#define MAX_CHAR 5 //MAX_CHAR is alphabet size, which includes "$"
#define MARKER '#' //when serializing a tree, put "#" behind node all children of which have already been stored in txt file
using namespace std;
struct treenode {
struct treenode *children[MAX_CHAR];
int start;
int end;
struct treenode *parent;
struct treenode *suffixlink;
treenode() { //constructor
for (int s = 0; s < MAX_CHAR; ++s) {
children[s] = NULL;
}
parent = NULL;
suffixlink = NULL;
start = -1;
end = -1;
}
~treenode() {} //destructor
};
struct node_point {
int pstart;
int pend;
node_point(struct treenode *node) {
pstart = node->start;
pend = node->end;
}
~node_point() {}
};
// Function to jumpdown, return the node where jumpdown ends at
treenode *jumpdown(treenode *node, node_point beta, const string &T) {
treenode *current = node;
int i = 0;
while (beta.pstart <= beta.pend) {
i = 0;
while (T[current->children[i]->start] != T[beta.pstart]) { // choosing the right direction to jumpdown
++i;
}
current = current->children[i];
beta.pstart = beta.pstart + (current->end - current->start) + 1; // update beta.pstart such that beta.pstart is the position where next turn of choosing right direction begins at
}
if (beta.pstart == beta.pend + 1) { // jumpdown stops at node "current"
return current;
} else { // jumpdown stops at an edge, need split the edge and insert new node "w"
treenode *w = new treenode();
w->start = current->start;
w->end = current->end - (beta.pstart - beta.pend) + 1;
w->parent = current->parent;
current->parent->children[i] = w;
current->parent = w;
current->start = w->end + 1;
w->children[0] = current;
return w;
}
}
// Function to walkdown, return the node where walkdown ends at
treenode *walkdown(treenode *node, node_point &gamma, int &p, const string &T) { // p records that the leafnode which is going to be inserted is the p-th children of the node which walkdown ends at
treenode *current = node;
treenode *return_node;
while (gamma.pstart <= gamma.pend) {
int i = 0;
while (current->children[i] != NULL && T[current->children[i]->start] != T[gamma.pstart]) { // choosing the right direction to walkdown
++i;
}
if (!current->children[i]) { // current->children[i] == NULL, indicates walkdown stops at node current
p = i;
return_node = current;
break;
} else {
int intermediate = gamma.pstart + (current->children[i]->end - current->children[i]->start);
int dist = 0;
while (T[current->children[i]->start + dist] == T[gamma.pstart] && gamma.pstart <= min(intermediate, gamma.pend)) {
++dist;
++(gamma.pstart);
}
if (gamma.pstart > min(intermediate, gamma.pend)) { // this only happens when matching till node current->children[i], so continue walkdown
current = current->children[i];
} else { // this only happens when mismatch happens in the middle of an edge, split this edge and add a new node "new_node_v"
treenode * new_node_v = new treenode();
new_node_v->start = current->children[i]->start;
new_node_v->end = current->children[i]->start + dist - 1;
current->children[i]->start = current->children[i]->start + dist;
current->children[i]->parent = new_node_v;
new_node_v->children[0] = current->children[i];
new_node_v->parent = current;
p = 1;
current->children[i] = new_node_v;
return_node = new_node_v;
break;
}
}
}
return return_node;
}
// Function to build the suffix tree
treenode *build_suffix_tree(const string &T, int len) {
// initialize root and leafnode0 for the suffix tree
treenode *root = new treenode();
root->parent = root;
root->suffixlink = root;
treenode *leafnode0 = new treenode();
root->children[0] = leafnode0;
leafnode0->parent = root;
leafnode0->start = 0;
leafnode0->end = len;
//initialize node_v_j in the McCreight lecture
treenode *node_v_j;
node_v_j = root;
treenode *w;
node_point beta = node_point(root);
node_point gamma = node_point(leafnode0);
for (int j = 1; j <= len; ++j) { // now we're going to insert suffix j
treenode *current_leafnode = new treenode(); // "current_leafnode" is the node we are going to insert in this turn
current_leafnode->start = j;
current_leafnode->end = len;
int I = 1; // indicator of whether need jumpdown
if (node_v_j == root) {// no need to jumpdown
++gamma.pstart;
I = 0;
} else if (node_v_j->parent == root) {
if (beta.pstart == beta.pend) {// no need to jumpdown
I = 0;
} else {// need to jumpdown
++beta.pstart;
}
}
if (I == 0) {
// No jumpdown, only need walkdown from the root;
int p = 0;
treenode * new_node_v;
new_node_v = walkdown (root, gamma, p, T);
new_node_v->children[p] = current_leafnode;
current_leafnode->parent = new_node_v;
current_leafnode->start = gamma.pstart;
} else { // need jumpdown;
w = jumpdown(node_v_j->parent->suffixlink, beta, T);
node_v_j->suffixlink = w;
if (!w->children[1]) { // w->children[1] = NULL; w is the split node when jumpdown is finished, so no need for walkdown, inserting leafnode directly
w->children[1] = current_leafnode;
current_leafnode->start = gamma.pstart;
current_leafnode->parent = w;
} else { // jumpdown stops at an existed node w, need walkdown from node "w"
int p = 0;
treenode * new_node_v;
new_node_v = walkdown (w, gamma, p, T);
new_node_v->children[p] = current_leafnode;
current_leafnode->parent = new_node_v;
current_leafnode->start = gamma.pstart;
}
}
//update gamma, beta and node_v_j
gamma = node_point(current_leafnode);
beta = node_point(current_leafnode->parent);
node_v_j = current_leafnode->parent;
}
return root;
}
void printSuffixTree(treenode *node, const string &T, int level = 0) {
if (node == nullptr) return;
// Print the current node
for (int i = 0; i < level; ++i) cout << " ";
if (node->start == -1) cout << "ROOT" << endl;
else cout << "[" << node->start << ", " << node->end << "]" << endl;
// Recursively print children
for (int i = 0; i < MAX_CHAR; ++i) {
if (node->children[i] != nullptr) {
printSuffixTree(node->children[i], T, level + 1);
}
}
}
int main() {
const string T = "banana"; // Example input string
int len = T.length();
treenode *root = build_suffix_tree(T, len);
cout << "Suffix tree construction completed." << endl;
// Print the suffix tree
cout << "Suffix Tree:" << endl;
printSuffixTree(root, T);
return 0;
}
Output:
Suffix tree construction completed.
Suffix Tree:
ROOT
[0, 6]
[1, 1]
[2, 3]
[4, 6]
[6, 6]
[6, 6]
[2, 3]
[4, 6]
[6, 6]
[6, 6]
Explanation:
- Data Structures: treenode: a node in the suffix tree, having pointers to its children, parent and suffix link also indices that represent start and end positions of substring it represents. node_point: point in the suffix tree represented by the start and end indices of a node.
- Jumpdown Function (jumpdown): Takes as input a starting node, a point beta and the input string T and determines where the jump-down operation ends It moves down the tree by taking appropriate child depending on characters in T.
- Walkdown Function (walkdown): Similar to jumpdown but can involve splitting edges if required. Matches characters from input string T against edge s of tree while walking down If mismatch occurs at some point within an edge, then splits that edge into two parts.
- Suffix Tree Construction (buildsuffixtree): It constructs a suffix tree from input string T Root and leaf nodes are initialized during this process. Jump-downs/walk-downs are used to insert suffixes into the tree.
- Print Suffix Tree (printSuffixTree): Recursively prints out all nodes of the given suffix-tree It starts with root node going through each child until reaches leaf-node and then backtracks again printing start/end indices.
- Main Function: It initializes the input string T. It constructs the suffix tree using buildsuffixtree. It prints the constructed suffix tree using printSuffixTree.
- treenode: a node in the suffix tree, having pointers to its children, parent and suffix link also indices that represent start and end positions of substring it represents.
- node_point: point in the suffix tree represented by the start and end indices of a node.
- Takes as input a starting node, a point beta and the input string T and determines where the jump-down operation ends
- It moves down the tree by taking appropriate child depending on characters in T.
- Similar to jumpdown but can involve splitting edges if required.
- Matches characters from input string T against edge s of tree while walking down
- If mismatch occurs at some point within an edge, then splits that edge into two parts.
- It constructs a suffix tree from input string T
- Root and leaf nodes are initialized during this process.
- Jump-downs/walk-downs are used to insert suffixes into the tree.
- Recursively prints out all nodes of the given suffix-tree
- It starts with root node going through each child until reaches leaf-node and then backtracks again printing start/end indices.
- It initializes the input string T.
- It constructs the suffix tree using buildsuffixtree.
- It prints the constructed suffix tree using printSuffixTree.
- McCreight's Algorithm is an algorithm designed to create a suffix tree for a string of length n in O(n) time. This is made possible through linear time complexity by processing each character of the string in an efficient manner, and updating suffix tree structure dynamically with the help of active point and end point ideas.
- Space complexity for McCreight's Algorithm is generally considered as O(n), where n represents the input size. In terms of input string length, this means that it uses space proportional to it; which is needed to keep record about what has been done so far during construction process (suffix representation).
Time and Space Complexities:
Conclusion:
In summary, McCreight's algorithm made significant improvements upon how suffix trees were constructed by providing a linear-time solution where none existed before. It also greatly improved string-processing abilities while making advanced text-based algorithms less difficult. As such, this method is key in many practical applications since it underlies various techniques used for pattern matching when dealing with strings or files containing them; thus enabling us solve computational problems efficiently over wide areas.