Weakly Connected Components Execution Time vs. Graph Size for DataWalk and Other Vendors. The vertical lines indicate the graph sizes at which computations for each vendor started failing to complete.

Blog article by
Łukasz Łaszczuk,
R&D Engineer at DataWalk

Weakly Connected Components in Large-Scale Knowledge Graphs

Benchmark Study

DataWalk excels in handling large graphs that do not fit into RAM and workflows that include incremental data loading.

See How It Works!

Introduction
What are Weakly Connected Components?
Business Applications of Connected Components
Weakly Connected Components in DataWalk Platform
Benchmark Methodology
Multi-Node Environment Performance
Conclusion

Introduction

DataWalk is an enterprise graph analytics platform designed to reveal patterns, relationships, and anomalies in large-scale, multi-source data. One of the key graph algorithms supported by DataWalk is weakly connected components, which efficiently analyzes and understands networks by breaking them into smaller, manageable groups, ensuring connectivity within each group.

DataWalk enhances data integration across diverse datasets, unifying structured and unstructured data for comprehensive analysis and decision-making. DataWalk's capabilities are particularly useful for large-scale networks in various business applications, including fraud detection in transactional data, social network analysis, and entity resolution. For instance, in fraud detection, DataWalk can identify groups of connected accounts or transactions that may indicate fraudulent activity, such as money laundering. An enterprise knowledge graph plays a vital role in consolidating organizational knowledge, enabling efficient data management, and supporting advanced AI applications through a structured representation of data and its relationships.

To evaluate its performance, DataWalk conducted benchmarks of the weakly connected components algorithm, comparing it with two leading graph database tools in real-world scenarios using blockchain Bitcoin transactional data. Benchmark results indicate DataWalk as the only vendor able to complete all of the experiments, while other vendors failed to do so for cases where the size of data exceeded available RAM.

What are Weakly Connected Components?

The weakly connected components algorithm is a crucial tool for pre-processing graphs, offering an effective means to understand the underlying data structure of large graph datasets. This algorithm automatically discovers network segments that are interconnected within their own community but disconnected from other communities. These segments are formed by various data points establishing relationships within the network.

Extracting connected components has several applications in further analytics, including:

Comprehensive Network View: Providing a detailed view of the network structure through analysis of component size distribution and examining the proportion of small and large communities.
Intermediate Analytical Step: Serving as a preliminary step for executing other algorithms on the detected components.
Community Analysis: Enabling a deeper understanding of individual communities and their properties, which can have practical applications, such as detecting fraud in financial or insurance networks by analyzing community statistics like the number of rings, diameter, and maximum ring perimeter.
Machine Learning Features: Utilizing component statistics or community labels (e.g., “Is the customer part of a community containing fraudulent transactions?”) as graph features for machine learning models, enhancing their performance.

It's important to highlight that the DataWalk graph analytical platform supports each of these applications effectively.

An example of two communities extracted using the weakly connected components algorithm is presented below:

An example of two weakly connected communities. On the left, we can see a community with a closed ring, and on the right - an open ring.

Business Applications of Connected Components

Connected Components in Blockchain Transactions

Analyzing blockchain transactions using connected components can help identify clusters of addresses and transactions that may indicate fraudulent activities or other significant patterns. This analysis is particularly useful in fraud detection and AML (Anti-Money Laundering) efforts.

Entity Resolution

Connected components are also crucial for entity resolution, which involves identifying and linking records that refer to the same entity across different data sources. By identifying connected components within the data, businesses can:

Consolidate Client Information: Merge duplicate records to create a single view of each client.
Enhance Data Quality: Improve the accuracy and consistency of data by resolving duplicate entries and linking related records.
Streamline Operations: Optimize processes such as customer onboarding, compliance checks, and support services by ensuring accurate and comprehensive information.

Weakly Connected Components in DataWalk Platform

In DataWalk, weakly connected components can be run on different types of graphs, including bipartite graphs with multiple types of connections as well as graphs consisting of multiple entity types. Bipartite graphs consist of two distinct entity types connected by various attributes. Unlike traditional data management systems, which often rely on a single schema, graph analytical platforms allow for more flexible and efficient handling of complex relationships. Although this article focuses on the application of weakly connected components in bipartite graphs, it's important to note that DataWalk is not limited in this aspect and can efficiently compute clusters on any type of graph, providing versatility in handling diverse datasets.

Advantages of DataWalk

No-Code Configuration

DataWalk allows users to configure weakly connected components without needing to learn any programming language, unlike some graph databases. This no-code approach makes it accessible to a broader range of users.

Automatic Result Integration

Once the algorithm finishes its execution, the results are seamlessly integrated into the DataWalk knowledge graph, providing an efficient and effective way to expand your analytics.

Comprehensive Community Metrics

DataWalk provides extensive graph metrics for analyzing communities, such as the number of rings, diameter, and maximum ring perimeter. These metrics help in understanding the properties and behaviors of different communities within the network.

Adding Context to Communities

Users can easily add business attributes and context to the detected communities, enhancing the richness and usability of the data for further analysis or machine learning. Example features added on top of extracted communities could include the number of fraudulent transactions inside the community, average value of transactions, number of suspicious accounts, etc. All of this can be easily created from DataWalk’s UI.

Quick Visualization of Communities

DataWalk enables visualization of communities, allowing users to quickly understand the structure and relationships within the extracted components. This capability is essential for identifying patterns and anomalies effectively. Users can select the communities for investigation based on the graph community metrics and additional business attributes described in previous sections.

Linking Results to Source Objects

DataWalk links the results to source objects, allowing you to easily identify which transactions and addresses belong to which components automatically.

Benchmark Methodology

Datasets and Experimental Setup

To test the weakly connected components algorithm, the benchmark used a graph representation created from three datasets related to the Bitcoin blockchain. These datasets were:

Transaction Inputs: Containing information about transaction senders from each block.
Transactions Aggregated: Aggregating transactions from each block.
Addresses: Including all relevant addresses involved in the transactions.

Graphical representation of the knowledge graph created in DataWalk for computing weakly connected components.

The dataset comprised Bitcoin transactions and addresses from 2009 to 2021, spanning over 13 years of transaction history. To compare performance across different data volumes, the dataset was divided into multiple sub-datasets, each covering a single year of transactions and newly active addresses for that year.

We can think of addresses (BTC Address) and transactions aggregated (BTC TX) as nodes, and transaction inputs (BTC TX IN) as edges. This structure implies that two addresses may be linked through a shared transaction if they both initiated Bitcoin transfers within the same block:

Example instance data - two Bitcoin addresses performing a transaction in the block numbered 27,525.

Description of the Datasets

We conducted 13 experiments to evaluate the connected components algorithm. The first experiment was run on data from 2009, and the second experiment included data from 2009 and 2010, this pattern continued until the final experiment, which used the entire dataset from 2009 to 2021. The final dataset consisted of approximately 1.5 billion nodes and edges.

This approach provided a comprehensive evaluation of the connected components algorithm across different data scales. By following this methodology, we tested scenarios where all the data fits into available RAM memory and scenarios where it does not. For each experiment, the connected components algorithm was run five times to ensure the consistency of the results. The algorithm was tested on single-node instances for all vendors. Additionally, we evaluated how DataWalk’s performance scales on three-node and six-node clusters.

Graph and Data size for iterative experiment scenario.

Hardware and Software Configurations

To ensure a fair comparison between DataWalk and graph database vendors, the experiments were performed on standardized hardware, specifically Amazon EC2 instances with comparable specifications. Amazon R6i instances with 128GB of RAM (r6i.4xlarge) were used for all tests.

Execution Times of Weakly Connected Components Algorithm

The following figure presents the average execution time for extracting connected components. The results are averaged over five runs of the algorithm every year for DataWalk and the other graph database tools.

DataWalk is the only platform that successfully completed all experiments on the r6i.4xlarge instance. The vertical lines in the picture highlight each vendor's failure points. The Vendor 2 was able to complete only 7 out of 13 experiments, and the Vendor 1 - 11 out of 13 experiments. The results show that the computation time for DataWalk increases linearly with the graph size, highlighting the platform's scalability and efficiency. This capability is particularly beneficial for scenarios where data cannot fit into RAM, demonstrating DataWalk's superior performance in handling large-scale datasets.

Multi-Node Environment Performance

DataWalk’s Performance Scaling with Additional Nodes

Previously, we tested the performance of DataWalk only in single-machine environments. In this section, we examine how DataWalk’s performance scales as the data is distributed across a cluster of machines. Specifically, we look at how increasing the number of database compute servers affects the load time and the speed of extracting connected components.

The following chart presents data load times in seconds for DataWalk’s 1, 3, and 6-node environments (each node being r6i.4xlarge instance).

The figure below shows the average execution time for extracting connected components using DataWalk’s 1, 3, and 6-node environments, averaged over five iterations of the experiment.

Weakly Connected Components Execution Time vs. Graph Size for DataWalk with different number of nodes.

From the graph, we can conclude that adding computation nodes reduces the time of weakly connected components computation for graphs larger than 40M of nodes and edges.

Conclusion

Summary of Findings

DataWalk excels in handling large graphs that do not fit into RAM and workflows that include incremental data loading.

Benefits of Using DataWalk for Large-Scale Blockchain Graph Analysis

There are several advantages to using DataWalk:

Low Resource Consumption: DataWalk is an optimal choice when additional tasks need to be performed in the environment alongside running the connected components algorithm.
Robust Software: DataWalk supports a wide range of data formats and offers ease of configuration.
Performance and Stability: DataWalk is highly performant and stable when loading large amounts of data incrementally, often outperforming other tools in terms of speed, especially in large-scale data scenarios.
Complete Experimentation: DataWalk is the only tool that completes all the considered experiments while providing faster computation times compared to other tools when dealing with large-scale data.

Additionally, performance improvements are observed when using multiple database servers with DataWalk, both in terms of data load times and algorithm execution runtimes. This demonstrates that the connected components algorithm, as well as other algorithms, can be parallelized and scaled horizontally using DataWalk.

Unlock the Power of DataWalk

See How It Works!

Weakly Connected Components in Large-Scale Knowledge Graphs

DataWalk excels in handling large graphs that do not fit into RAM and workflows that include incremental data loading.

Table of contents

Introduction

What are Weakly Connected Components?

Business Applications of Connected Components

Connected Components in Blockchain Transactions

Entity Resolution

Weakly Connected Components in DataWalk Platform

Advantages of DataWalk

No-Code Configuration

Automatic Result Integration

Comprehensive Community Metrics

Adding Context to Communities

Quick Visualization of Communities

Linking Results to Source Objects

Benchmark Methodology

Datasets and Experimental Setup

Description of the Datasets

Hardware and Software Configurations

Execution Times of Weakly Connected Components Algorithm

Multi-Node Environment Performance

DataWalk’s Performance Scaling with Additional Nodes

Conclusion

Summary of Findings

Benefits of Using DataWalk for Large-Scale Blockchain Graph Analysis

Unlock the Power of DataWalk