Interactive Programming Education Platform

Logic Practice

Introduction

HITS serves as a link analysis technique widely utilized in the realm of web search; it originated from the research efforts of Jon Kleinberg and is referred to as Hyperlink-Induced Topic Search. In contrast to PageRank, which focuses on overall popularity, HITS distinguishes between two categories of pages: hubs and authorities. An authority page typically contains valuable information, evident from the citations it receives, whereas a hub page links to authoritative pages of significance. Each webpage on the site is assigned two ranking scores: one for being a hub and the other for being an authority. These scores are then adjusted based on the connections within the web graph. A multitude of hubs direct towards authorities, and vice versa, illustrating reciprocal relationships. The HITS algorithm is particularly beneficial for identifying cohesive sets of content in densely interconnected environments, like article references or topic-specific directories, though it may be susceptible to distortions caused by link spam or graph loops.

Mathematical Concept behind HITS Algorithm

The HITS algorithm is based on a mathematical structure that incorporates concepts from graph theory and linear algebra. It represents the internet as a directed graph, with individual web pages serving as nodes and the hyperlinks as directional edges. Within this framework, each webpage receives two distinct values: an authority score and a hub score. These numerical values are computed through a series of iterative calculations that continue until reaching a state of convergence.

1. Adjacency Matrix

The graph structure denoted as A is symbolized by an adjacency matrix A, where A_ij equals 1 if page i contains a hyperlink to page j, and 0 if not. The information within the rows and columns of A is employed in computing hub and authority scores, focusing on inbound and outbound links correspondingly.

2. Authority and Hub Scores

Let x and y represent the authority and hub ratings of individual pages, respectively.

✅ The authority ratings xi for pages i are calculated based on the combined hub scores of all pages that have links pointing to them: x i =∑ j A ji y j.
✅ Similarly, the hub rating y i for page i is influenced by the collective authority ratings of the pages it is linked to: y i =∑ j A ji y j.

3. Iterative Computation

After setting the initial scores, the matrices x and y undergo modifications through the matrix operations below.

✅ Revise authority vector: x=A T y
✅ Revise hub vector: y=Ax

4. Normalization and Convergence

To avoid score deviation, the values of x and y are adjusted following each full update. This iteration continues until an equilibrium in the scores is reached, signifying the ranking of both the hubs and the authorities.

The iterative approach in this scenario reflects the concept of mutual strengthening: pages with elevated scores boost those with lower scores and reciprocally, establishing a significant ranking mechanism for the pages.

Key Components of the HITS Algorithm

The HITS (Hyperlink-Induced Topic Search) algorithm comprises various essential elements that delineate its framework and operational characteristics:

1. Hub and Authority Scores:

According to the HITS algorithm for web ranking, each webpage receives two scores: a hub score and an authority score. A hub is essentially a page that links to numerous authoritative pages, while an authority is a page that is linked to by many hub pages. This bipartite scoring mechanism helps pinpoint the most pertinent pages in terms of content credibility (authority) and page endorsements (hub).

2. Graph Representation:

A website's structure can be visually depicted through a direct graph, with web pages represented as nodes and hyperlinks as edges. These connections signify one page's endorsement of another, playing a crucial role in defining hub and authority dynamics within the site.

3. Adjacency Matrix:

An examination of HITS involves utilizing an adjacency matrix that indicates the connections between different websites. Whenever a link exists from page i to page j, the value Aij in the adjacency matrix is set to 1; otherwise, it is set to 0. This matrix plays a crucial role in updating authority and hub scores during each iteration.

4. Mutual Reinforcement Relationship:

The algorithm relies on the symbiotic connection between hubs and authorities. When a hub with a high score links to a page, it boosts the authority score of that page. Similarly, when a page links to high-scoring authorities, it enhances its own hub score. This interdependence of scores requires them to be calculated successively and iteratively.

5. Iterative Computation:

HITS employs an iterative process to update the Hub and authority scores based on their link structure. The calculation commences with the original scores (typically set to one) for each page, and these scores are iteratively recalculated for all linked pages until convergence is achieved.

6. Normalization and Convergence:

Each final score within a sequence divides all scores in a way that ensures the dominant score remains stable across multiple iterations. This normalization technique is crucial for preventing scores from drifting apart. The process of mean shifting continues until a point is reached where the authority and hub scores stabilize, indicating that they have effectively established connections to understand their respective values.

7. Output:

The end result of the algorithm comprises pages along with their corresponding authority and hub scores, presented in a list sorted by significance as authorities and pertinence as hubs. This outcome plays a crucial role in identifying content clusters and determining the most pertinent pages within a specific website. By integrating these key elements, the HITS algorithm can analyze the linking patterns and accurately prioritize pages based on their relevance and authority.

The HITS algorithm in its steps

HITS, also known as Hyperlink-Induced Topic Search, represents an algorithm that executes various steps to determine the authority and hub scores of web pages based on their link structures. Here is the process it entails:

1. Initialization of Scores:

Launch every page with initial prestige and hub values, typically setting 1 as the default. As a result, modifications will be applied to this value in the upcoming iterations.

2. Looping Through:

Authority Update: Enhancing the authoritative standing of each page involves adjusting its authority score, denoted as A. This adjustment is achieved by aggregating the hub scores (H) for each page N. The process involves manually calculating the authority score using the formula: xi = Σ j->i hj, where hj represents the frequency of H measuring page j.

Hub Update: When considering each page's average number of linked pages in ICPP tutorials, the hub score H is recalculated based on the sum of x's over j, indicating authority. This can be expressed mathematically as: yi = Øj->i xj, where xj represents the frequency of i pointing to authoritative pages j.

3. Normalization:

Mass authority or hub values can be altered without any deviation. This process is more biological in nature, as it typically involves dividing a specific value by the total sum of all values within the same category.

4. Check for Convergence:

The iterative computation and standardization process can be repeated until the authority and hub rankings stabilize. This convergence usually occurs when adjustments to a score from one iteration to the next are below a specified threshold value.

5. Interpret Results:

Once the algorithm has reached convergence, it will generate pages displaying their ultimate authority and hub scores. Pages with a high authority score are recognized as experts on the topic, whereas pages with a high hub score are valuable sources for authoritative pages.

6. Rank And Analyze:

Arrange the pages based on their hub and authority scores. Pages that exhibit high authority and hub scores are deemed significant and trustworthy for search engine rankings and dispensing knowledge, respectively.

This sequential method of reciprocal strengthening between hub and authority scores enables the HITS algorithm to identify not just highly ranked pages worldwide, but also proficient hub nodes present on these pages within a web graph.

Implementing the HITS Algorithm in C++

It is a basic application of the HITS Algorithm in C++ programming language. The program employs an adjacency matrix to represent a web graph, where pages are interconnected, and the HITS algorithm computes both the authority and hub scores through iterative updates.

Example


#include <iostream>
#include <vector>
#include <cmath>

using namespace std;

// Function to normalize a vector
void normalize(vector<double>& scores) {
    double sum = 0.0;
    for (double score : scores) {
        sum += score * score;
    }
    sum = sqrt(sum);
    for (double& score : scores) {
        score /= sum;
    }
}

// HITS Algorithm Function
void hitsAlgorithm(vector<vector<int>>& adjMatrix, int maxIterations = 100, double threshold = 1e-6) {
    int n = adjMatrix.size();
    vector<double> authorityScores(n, 1.0);  // Initial authority scores
    vector<double> hubScores(n, 1.0);        // Initial hub scores

    for (int iter = 0; iter < maxIterations; ++iter) {
        vector<double> newAuthorityScores(n, 0.0);
        vector<double> newHubScores(n, 0.0);

        // Update authority scores
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < n; ++j) {
                if (adjMatrix[j][i] == 1) { // if there's a link from j to i
                    newAuthorityScores[i] += hubScores[j];
                }
            }
        }

        // Update hub scores
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < n; ++j) {
                if (adjMatrix[i][j] == 1) { // if there's a link from i to j
                    newHubScores[i] += authorityScores[j];
                }
            }
        }

        // Normalize the new scores
        normalize(newAuthorityScores);
        normalize(newHubScores);

        // Check for convergence
        double error = 0.0;
        for (int i = 0; i < n; ++i) {
            error += fabs(newAuthorityScores[i] - authorityScores[i]) + fabs(newHubScores[i] - hubScores[i]);
        }
        if (error < threshold) {
            break;
        }

        // Update scores for the next iteration
        authorityScores = newAuthorityScores;
        hubScores = newHubScores;
    }

    // Output the authority and hub scores
    cout << "Final Authority Scores:" << endl;
    for (int i = 0; i < n; ++i) {
        cout << "Page " << i << ": " << authorityScores[i] << endl;
    }

    cout << "Final Hub Scores:" << endl;
    for (int i = 0; i < n; ++i) {
        cout << "Page " << i << ": " << hubScores[i] << endl;
    }
}

int main() {
    // Example adjacency matrix for a web graph with 4 pages
    vector<vector<int>> adjMatrix = {
        {0, 1, 1, 0},
        {0, 0, 1, 1},
        {0, 0, 0, 1},
        {1, 0, 0, 0}
    };

    hitsAlgorithm(adjMatrix);

    return 0;
}

Output:

Output


Final Authority Scores:
Page 0: 2.08245e-10
Page 1: 0.327985
Page 2: 0.736976
Page 3: 0.591009
Final Hub Scores:
Page 0: 0.591009
Page 1: 0.736976
Page 2: 0.327985
Page 3: 2.08245e-10

Explanation:

✅ A Representational Approach to Genetic Programming: The web graph is represented as an adjacency matrix (adjMatrix), with adjMatrixi = 1 representing there is a link from page i to page j, and 0 otherwise.
✅ Initialization: Authority and hub scores for each page are initialized to 1.0.
✅ Iterative updates: For each iteration, calculate new authority scores for each page as the sum of hub scores from all pages linking to it. Calculate new hub scores for each page as the sum of authority scores of all pages it links to. The scores are then normalized so that they fall on the same scale.
✅ Convergence Check: The algorithm will terminate if authority and hub score changes between iterations are beneath a threshold to make sure that the scores have actually converged.
✅ Output: The final authority and hub scores for each page are printed, indicating the pages' overall most significant authorities and hubs.

Applications of the HITS Algorithm

The HITS (Hyperlink-Induced Topic Search) algorithm has a range of applications, primarily where link structures are used to identify authoritative and hub resources within a network. Here are some of the key applications:

✅ Web Search and Information Retrieval: In search engines, HITS is used to rank web pages based on the web link structure. analysing the link structure of the web. Hub pages are directories or aggregators that link to authority pages, making search results more relevant; and authority pages are considered trustworthy sources on particular topics. The HITS algorithm is used to identify influential nodes in social networks.to rank web pages by analysing the link structure of the web. Authority pages are considered trustworthy sources on specific topics, and hub pages are directories or aggregators that link to these authority pages, enhancing the relevance of search results.
✅ Social Network Analysis: In social networks, the HITS algorithm helps identify influential nodes. Hubs are users or accounts that connect to these influential entities and authorities can be key individuals or organizations that are frequently cited. It helps to understand the information flow and influence within the network.
✅ Academic Citation Analysis: HITS can separate important research papers (authorities) from literature reviews or survey papers (hubs) that link to influential research in academic citation networks sources on specific topics, and hub pages are directories or aggregators that link to these authority pages, enhancing the relevance of search results.
✅ Product Recommendation Systems and E-commerce : The HITS algorithm can be used to recommend products by identifying high-activity hub users or curated lists that frequently link or reference authoritative products. On specific topics, and hub pages are directories or aggregators that link to these authority pages, enhancing the relevance of search results.
✅ Biological and Medical Networks: In biology, HITS can be applied to analyse gene and protein interaction networks, identifying key proteins (authorities) that interact with numerous other proteins (hubs). By pin pointing target proteins or genes that are key to understanding disease, this helps.
✅ Document and Content Recommendation System: HITS can be used by content platforms, like news websites or streaming services, to recommend documents or media by finding trending or high-quality content (authorities) and curated lists or playlists that aggregate this content (hubs).
✅ Fraud Detection in Financial Networks: HITS can be used in financial networks to find out which accounts are authorities or hubs in the suspicious network structures of the web. Authority pages are considered trustworthy sources on specific topics, and hub pages are directories or aggregators that link to these authority pages, enhancing the relevance of search results.
✅ Topic-Specific Search and Content Clustering: Specific Search by Topics and Distribution of Related Articles. HITS is extensible for topic-sensitive search applications to develop a set of clusters of objects the topic of interest to create clusters of related content. HITS used in a topic domain can identify groups of related pages and can be effective in digital libraries and educational resources.

Limitations of the HITS Algorithm

These applications make use of the mutual reinforcement principle of HITS that allows one to easily build of important nodes or content in a large network.

✅ Query-Dependence: HITS is designed to work on a small subset of web pages related to a specific search query rather than the entire web. This limits its scalability, as it must be run separately for each query, which is computationally intensive.
✅ Susceptibility to Link Spam: Since HITS relies on link structure to determine authority and hub scores, it is highly susceptible to link manipulation. For example, spam website pages that contain unimportant and irrelevant information can establish links between the pages to make it challenging to distinguish between spam authority pages. In practice, it might deviate from the set topic of search since HITS only uses hyperlink information to create clusters of related content. By analysing page links within a topic domain, HITS helps find clusters of relevant pages, which can be useful in digital libraries or educational platforms.
✅ Tendency to Drift: In practice, HITS may drift away from the original query topic due to its reliance on link analysis. If irrelevant or highly linked pages are included in the subset, a large number of results may flood in and the final set returns highly unwanted results due to a skewed algorithm.
✅ Difficulty with PageRank-like Global Authority: Unlike PageRank, which can calculate a global ranking of web pages, HITS focuses on query-specific subsets. This makes it less suitable for providing a complete ranking solution since it only ranks hubs and authorities with respect to the limited context, reducing its application for general purposes.
✅ Not Effective for Dynamic or Real Time-Based Updates: The computation of scores using HITS is iterative and hence costs a lot of time in environments such as the dynamic. Real-time score recalculation is very difficult and is less suitable for fast- moving content.
✅ Favouritism of Well-Liked Nodal Points and Authorities: On the one hand, HITS seems to select highly popular hub pages which link to many pages and well-linked authorities. It is computationally expensive and inefficient for dynamic web environments where link structures frequently change. Re-calculating scores in real-time is challenging, making it less practical for rapidly changing content.
✅ Lacks of Consideration for Content Quality: HITS also do not take into account textual information contained in the linked WWW pages.to compute scores, it is computationally expensive and inefficient for dynamic web environments where link structures frequently change. Re-calculating scores in real-time is challenging, making it less practical for rapidly changing content.

Conclusion:

In summary, the equally crucial Hyperlink-Induced Topic Search (HITS) algorithm plays a vital role in determining the significance of hub and authority pages in a given network. By deriving authority and hub scores based on the linking pattern, HITS is particularly effective in contexts like scholarly citation evaluation, mapping social network influence, and conducting topic-oriented searches, facilitating the clustering of interconnected documents.

Nonetheless, there are evident limitations of HITS: the algorithm is responsive to link spam and relies on specific query-driven subsets; moreover, this version is unreliable in terms of convergence within dense networks of links. It lacks content quality assessment and is prone to manipulation in an unrestricted setting. Despite these drawbacks, HITS remains valuable for retrieving information and recommending content in well-defined domains. In scenarios requiring comprehensive ranking, alternative algorithms like PageRank can supplement or substitute HITS to deliver enhanced, potentially more pertinent results in extensive, ever-changing environments.

Hits Algorithm In C++