Interactive Programming Education Platform

Logic Practice

Introduction:

A Burkhard-Keller tree, commonly known as a BK tree, serves as a specialized data structure intended for effective approximate string comparison. This data structure proves valuable in various scenarios like spell checking, predictive text input, and genetic sequence analysis, focusing on identifying words or sequences that closely resemble a provided query. By enabling rapid retrieval of elements based on a defined edit distance, the BK tree excels in scenarios involving extensive datasets where conventional exact matching methods may not suffice.

Problem Statement:

The main issue addressed by BK trees is the effective identification of all items in a collection that fall within a specified edit distance from a search term. Typically, the edit distance, often represented by the Levenshtein distance, quantifies the number of individual character modifications (such as insertions, deletions, or substitutions) required to convert one string into another. This concept plays a vital role in various domains including information retrieval, bioinformatics, and error detection and correction.

History:

The BK tree was first presented by Walter Burkhard and Robert Keller in their 1973 publication named "Some Approaches to Best Match File Searching". This data structure takes advantage of metric space characteristics, particularly the triangle inequality, to minimize the quantity of comparisons required for locating close matches.

How does BK Tree Works?

Insertion:

✅ The tree starts with a root node containing the first element.
✅ In order to insert a new element, compute the edit distance to the root.
✅ Traverse the tree: At each node, compute the distance to the node's value, and then follow the child node corresponding to this distance.
✅ After that, insert the new element at the appropriate position once a leaf node is reached or the proper child does not exist.

Search:

✅ Given a query and a maximum allowable distance, start from the root.
✅ At each node, compute the distance to the query.
✅ Traverse children nodes whose distance to the node's value falls within the allowable range (current distance minus max distance to current distance plus max distance).
✅ Collect all nodes within the allowable range.

For Instance:

Consider constructing a BK tree using the words "book", "back", and "books". Begin by inserting "book" as the initial node. Then, proceed to insert "back" as a child node, which is at a distance of 2 from "book". Lastly, insert "books" as another child node, which is at a distance of 1 from "book".

To search for strings within a distance of 1 from "boon":

✅ Start at "book" (distance 1).
✅ Move to "books" (distance 2, not within range).
✅ No other relevant children exist, so the result is "book".

Example 1:

Here is a basic implementation of a Burkhard-Keller tree in C++ leveraging the Levenshtein distance algorithm:

Example


#include <iostream>
#include <unordered_map>
#include <vector>
#include <string>
#include <algorithm>

class BKTree {
private:
    struct Node {
        std::string word;
        std::unordered_map<int, Node*> children;
        Node(const std::string& w) : word(w) {}
    };

    Node* root;

    int levenshteinDistance(const std::string& s1, const std::string& s2) {
        int m = s1.size();
        int n = s2.size();
        std::vector<std::vector<int>> dp(m + 1, std::vector<int>(n + 1));
        for (int i = 0; i <= m; ++i) dp[i][0] = i;
        for (int j = 0; j <= n; ++j) dp[0][j] = j;

        for (int i = 1; i <= m; ++i) {
            for (int j = 1; j <= n; ++j) {
                if (s1[i - 1] == s2[j - 1]) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else {
                    dp[i][j] = 1 + std::min({dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]});
                }
            }
        }
        return dp[m][n];
    }

    void insert(Node* node, const std::string& word) {
        int distance = levenshteinDistance(node->word, word);
        auto it = node->children.find(distance);
        if (it != node->children.end()) {
            insert(it->second, word);
        } else {
            node->children[distance] = new Node(word);
        }
    }

    void search(Node* node, const std::string& query, int maxDistance, std::vector<std::string>& results) {
        int distance = levenshteinDistance(node->word, query);
        if (distance <= maxDistance) {
            results.push_back(node->word);
        }
        for (auto& [dist, child] : node->children) {
            if (dist >= distance - maxDistance && dist <= distance + maxDistance) {
                search(child, query, maxDistance, results);
            }
        }
    }

public:
    BKTree() : root(nullptr) {}

    void insert(const std::string& word) {
        if (!root) {
            root = new Node(word);
        } else {
            insert(root, word);
        }
    }

    std::vector<std::string> search(const std::string& query, int maxDistance) {
        std::vector<std::string> results;
        if (root) {
            search(root, query, maxDistance, results);
        }
        return results;
    }
};

int main() {
    BKTree tree;
    tree.insert("book");
    tree.insert("back");
    tree.insert("books");

    std::string query = "boon";
    int maxDistance = 1;
    auto results = tree.search(query, maxDistance);

    std::cout << "Words within distance " << maxDistance << " from \"" << query << "\":" << std::endl;
    for (const auto& word : results) {
        std::cout << word << std::endl;
    }

    return 0;
}

Output:

Output


Words within distance 1 from "boon":
book

Explanation:

✅ Node Structure: In this example, each node contains a word and a map of child nodes indexed by their distance from the current node's word.
✅ Levenshtein Distance: The levenshteinDistance function calculates the edit distance between two strings.
✅ Insert Function: The insert function adds new words to the tree by finding the appropriate position based on the edit distance.
✅ Search Function: The search function looks for all words within a specified distance from the query, recursively checking nodes that fall within the distance range.
✅ The BK tree's structure ensures efficient insertion and searching, leveraging the properties of metric spaces to minimize unnecessary comparisons.

Insertion Complexity

✅ Time Complexity: Inserting a word into a BK tree involves comparing the new word to words already in the tree. In the worst case, each insertion requires traversing the tree from the root to a leaf node. Let d be the average distance between words and n be the number of words in the tree. The worst-case time complexity for a single insertion is O(d⋅n) , where d is the time to compute the edit distance between two words. For Levenshtein distance, d is O(m^2) for two words of length m, making the overall worst-case insertion time O(m2⋅n) .
✅ Space Complexity: Each node stores a word and a map of child nodes indexed by distance. Assuming average branching factor b per node, the space complexity for storing the tree is O(n) for n words. Additional space is required for the distance maps, but this is generally proportional to the number of nodes, hence O(n) .
✅ Inserting a word into a BK tree involves comparing the new word to words already in the tree. In the worst case, each insertion requires traversing the tree from the root to a leaf node.
✅ Let d be the average distance between words and n be the number of words in the tree.
✅ The worst-case time complexity for a single insertion is O(d⋅n) , where d is the time to compute the edit distance between two words.
✅ For Levenshtein distance, d is O(m^2) for two words of length m, making the overall worst-case insertion time O(m2⋅n) .
✅ Each node stores a word and a map of child nodes indexed by distance.
✅ Assuming average branching factor b per node, the space complexity for storing the tree is O(n) for n words.
✅ Additional space is required for the distance maps, but this is generally proportional to the number of nodes, hence O(n) .

Search Complexity

✅ Time Complexity: Searching involves traversing parts of the tree where the edit distance falls within the specified range. Let k be the number of words found within the specified edit distance r. The search time complexity depends on the branching factor and depth of the tree. On average, it can be approximated by O(br⋅d⋅n) . For a tree of depth log nlogn and branching factor b, the average search time complexity O(br⋅m2⋅logn) (assuming Levenshtein distance).
✅ Searching involves traversing parts of the tree where the edit distance falls within the specified range.
✅ Let k be the number of words found within the specified edit distance r.
✅ The search time complexity depends on the branching factor and depth of the tree. On average, it can be approximated by O(br⋅d⋅n) .
✅ For a tree of depth log nlogn and branching factor b, the average search time complexity O(br⋅m2⋅logn) (assuming Levenshtein distance).

Overall Complexity

✅ Insertion: Time: O(m2⋅n) in the worst case. Space: O(n) for storing the words and tree structure.
✅ Search: Time: O (br⋅m2⋅logn) on average, where b is the branching factor, r is the edit distance range, and mmm is the length of the words.
✅ Time: O(m2⋅n) in the worst case.
✅ Space: O(n) for storing the words and tree structure.
✅ Time: O (br⋅m2⋅logn) on average, where b is the branching factor, r is the edit distance range, and mmm is the length of the words.

Example 2:

Let's consider a scenario to demonstrate how the BK-Tree functions in C++.

Example


#include "bits/stdc++.h"
using namespace std;

// maximum number of words in wordList[]
#define MAX_WORDS 100

// defines the tolerance value
#define TOLERANCE 2

// defines maximum length of a word
#define MAX_LEN 10

struct TrieNode
{
    // stores the word of the current TrieNode
    string word;

    // links to other TrieNode in the trie
    int next[2*MAX_LEN];

    // constructors
    TrieNode(string x):word(x)
    {
        // initializing next[i] = 0
        for(int i=0; i<2*MAX_LEN; i++)
            next[i] = 0;
    }
    TrieNode() {}
};

// stores the root TrieNode
TrieNode root;

// stores every TrieNode of the trie
TrieNode trie[MAX_WORDS];

// index for current TrieNode of trie
int nodeIndex;

int minimum(int a, int b, int c)
{
    return min(a, min(b, c));
}

// Edit Distance
// Dynamic-Approach O(m*n)
int calculateEditDistance(string& a, string& b)
{
    int m = a.length(), n = b.length();
    int dp[m+1][n+1];

    // filling base cases
    for (int i=0; i<=m; i++)
        dp[i][0] = i;
    for (int j=0; j<=n; j++)
        dp[0][j] = j;

    // populating matrix using dp-approach
    for (int i=1; i<=m; i++)
    {
        for (int j=1; j<=n; j++)
        {
            if (a[i-1] != b[j-1])
            {
                dp[i][j] = minimum( 1 + dp[i-1][j], 
                                    1 + dp[i][j-1], 
                                    1 + dp[i-1][j-1] 
                                  );
            }
            else
                dp[i][j] = dp[i-1][j-1];
        }
    }
    return dp[m][n];
}

// adds current TrieNode to the trie
void addToTrie(TrieNode& rootNode, TrieNode& currentNode)
{
    if (rootNode.word == "")
    {
        // if it is the first TrieNode
        // then make it the root TrieNode
        rootNode = currentNode;
        return;
    }

    // get its calculateEditDistance from the root TrieNode
    int distance = calculateEditDistance(currentNode.word, rootNode.word);

    if (trie[rootNode.next[distance]].word == "")
    {
        /* if no TrieNode exists at this distance from root
        * make it child of root TrieNode*/

        // incrementing the pointer for current TrieNode
        nodeIndex++;

        // adding current TrieNode to the trie
        trie[nodeIndex] = currentNode;

        // currentNode as child of root TrieNode
        rootNode.next[distance] = nodeIndex;
    }
    else
    {
        // recursively find the parent for current TrieNode
        addToTrie(trie[rootNode.next[distance]], currentNode);
    }
}

vector<string> findSimilarWords(TrieNode& rootNode, string& s)
{
    vector<string> result;
    if (rootNode.word == "")
        return result;

    // calculating calculateEditDistance of s from rootNode
    int distance = calculateEditDistance(rootNode.word, s);

    // if distance is less than tolerance value
    // add it to similar words
    if (distance <= TOLERANCE) result.push_back(rootNode.word);

    // iterate over the string having tolerance
    // in range (distance-TOLERANCE , distance+TOLERANCE)
    int start = distance - TOLERANCE;
    if (start < 0)
        start = 1;

    while (start <= distance + TOLERANCE)
    {
        vector<string> temp =
            findSimilarWords(trie[rootNode.next[start]], s);
        for (auto i : temp)
            result.push_back(i);
        start++;
    }
    return result;
}

// driver program to run above functions
int main(int argc, char const *argv[])
{
    // word list
    string wordList[] = {"hell", "help", "shell", "smell",
                         "fell", "felt", "oops", "pop", "oouch", "halt"
                        };
    nodeIndex = 0;
    int listSize = sizeof(wordList)/sizeof(string);

    // adding wordList[] words on to trie
    for(int i=0; i<listSize; i++)
    {
        TrieNode temp = TrieNode(wordList[i]);
        addToTrie(root, temp);
    }

    string word1 = "ops";
    string word2 = "helt";
    vector<string> matches = findSimilarWords(root, word1);
    cout << "Similar words in dictionary for: " << word1 << ":\n";
    for (auto x : matches)
        cout << x << endl;

    matches = findSimilarWords(root, word2);
    cout << "Correct words in dictionary for " << word2 << ":\n";
    for (auto x : matches)
        cout << x << endl;

    return 0;
}

Output:

Output


similar words in dictionary for : ops:
oops
pop
Correct words in dictionary for helt:
hell
help
shell
fell
felt
halt

Explanation:

✅ Libraries and Macros #include "bits/stdc++.h": It includes a set of commonly used standard libraries in competitive programming. #define MAXN 100: It defines the maximum number of words in the dictionary. #define TOL 2: It defines the tolerance value, which specifies the maximum allowed edit distance between a given word and a word in the dictionary for them to be considered similar. #define LEN 10: It defines the maximum length of a word.
✅ Data Structures struct Node: It represents a node in the trie. It stores the word (string word) associated with the node, and an array next to store indices of child nodes in the trie. Node RT: It represents the root node of the trie. Node tree[MAXN]: An array to store all nodes of the trie. int ptr: Pointer to keep track of the current node in the trie.
✅ Functions int min(int a, int b, int c): It returns the minimum of three integers. int editDistance(string& a,string& b): It computes the edit distance between two strings using dynamic programming. void add(Node& root, Node& curr): It adds a node to the trie. It calculates the edit distance between the current word and the word associated with the root, and adds the current node accordingly. vector <string> getSimilarWords(Node& root,string& s): It retrieves similar words to a given string within the tolerance limit specified by TOL. It recursively explores the trie to find similar words. main: The main driver function. It initializes the trie with some sample words, calculates and prints similar words for two test cases.
✅ #include "bits/stdc++.h": It includes a set of commonly used standard libraries in competitive programming.
✅ #define MAXN 100: It defines the maximum number of words in the dictionary.
✅ #define TOL 2: It defines the tolerance value, which specifies the maximum allowed edit distance between a given word and a word in the dictionary for them to be considered similar.
✅ #define LEN 10: It defines the maximum length of a word.
✅ struct Node: It represents a node in the trie. It stores the word (string word) associated with the node, and an array next to store indices of child nodes in the trie.
✅ Node RT: It represents the root node of the trie.
✅ Node tree[MAXN]: An array to store all nodes of the trie.
✅ int ptr: Pointer to keep track of the current node in the trie.
✅ int min(int a, int b, int c): It returns the minimum of three integers.
✅ int editDistance(string& a,string& b): It computes the edit distance between two strings using dynamic programming.
✅ void add(Node& root, Node& curr): It adds a node to the trie. It calculates the edit distance between the current word and the word associated with the root, and adds the current node accordingly.
✅ vector <string> getSimilarWords(Node& root,string& s): It retrieves similar words to a given string within the tolerance limit specified by TOL. It recursively explores the trie to find similar words.
✅ main: The main driver function. It initializes the trie with some sample words, calculates and prints similar words for two test cases.

Time Complexity:

It is clear that the time complexity is significantly influenced by the tolerance threshold. For the purpose of this discussion, we will assume a tolerance threshold of 2. Broadly estimating, the depth of the BK Tree is expected to be logarithmic with respect to n, where n represents the size of the dictionary. Hence, the Time Complexity can be expressed as O(L1L2log n), with L1 denoting the average word length in the dictionary and L2 representing the length of the misspelled word. Typically, both L1 and L2 are of limited size.

Auxiliary Space:

The space efficiency of the preceding implementation amounts to O(L1*L2), with L1 representing the quantity of words in the dictionary and L2 denoting the word's maximum length.

Conclusion:

In summary, BK trees, named in honor of Burkhard and Keller, represent a sophisticated and effective data organization method crafted to address the challenge of close string matching. They stand out in scenarios that demand fetching strings falling within a designated edit distance, like spell checking, predictive text, and genetic analysis.

Key Points to Keep in Mind:

✅ Effective Approximate Matching: BK trees offer rapid retrieval of words within a defined edit distance from a given query, making them ideal for handling extensive datasets where exact matches fall short.
✅ Levenshtein Distance: Commonly paired with BK trees, the Levenshtein distance calculates the count of single-character modifications required to convert one string to another. This metric is leveraged by the tree arrangement to efficiently structure and retrieve data.

Inserting and Retrieving:

✅ Adding a new word requires navigating the tree to determine the appropriate location based on its proximity to other words already in place.
✅ Retrieval operations leverage the triangle inequality principle to restrict the number of comparisons, guaranteeing swift access to words falling within the specified edit distance.

Time Complexity: The efficiency of insertion and search operations is typically high, affected by factors such as the average word length and the structure of the tree.

Space Complexity: BK trees consume space in proportion to the quantity of words, along with extra space for handling child nodes and distance maps.

Bk Tree In C++