DataWalk is an enterprise graph analytics platform designed to reveal patterns, relationships, and anomalies in large-scale, multi-source data. One of the key graph algorithms supported by DataWalk is weakly connected components, which efficiently analyzes and understands networks by breaking them into smaller, manageable groups, ensuring connectivity within each group.
DataWalk enhances data integration across diverse datasets, unifying structured and unstructured data for comprehensive analysis and decision-making. DataWalk's capabilities are particularly useful for large-scale networks in various business applications, including fraud detection in transactional data, social network analysis, and entity resolution. For instance, in fraud detection, DataWalk can identify groups of connected accounts or transactions that may indicate fraudulent activity, such as money laundering. An enterprise knowledge graph plays a vital role in consolidating organizational knowledge, enabling efficient data management, and supporting advanced AI applications through a structured representation of data and its relationships.
To evaluate its performance, DataWalk conducted benchmarks of the weakly connected components algorithm, comparing it with two leading graph database tools in real-world scenarios using blockchain Bitcoin transactional data. Benchmark results indicate DataWalk as the only vendor able to complete all of the experiments, while other vendors failed to do so for cases where the size of data exceeded available RAM.
The weakly connected components algorithm is a crucial tool for pre-processing graphs, offering an effective means to understand the underlying data structure of large graph datasets. This algorithm automatically discovers network segments that are interconnected within their own community but disconnected from other communities. These segments are formed by various data points establishing relationships within the network.
Extracting connected components has several applications in further analytics, including:
It's important to highlight that the DataWalk graph analytical platform supports each of these applications effectively.
An example of two communities extracted using the weakly connected components algorithm is presented below:
Analyzing blockchain transactions using connected components can help identify clusters of addresses and transactions that may indicate fraudulent activities or other significant patterns. This analysis is particularly useful in fraud detection and AML (Anti-Money Laundering) efforts.
Connected components are also crucial for entity resolution, which involves identifying and linking records that refer to the same entity across different data sources. By identifying connected components within the data, businesses can:
In DataWalk, weakly connected components can be run on different types of graphs, including bipartite graphs with multiple types of connections as well as graphs consisting of multiple entity types. Bipartite graphs consist of two distinct entity types connected by various attributes. Unlike traditional data management systems, which often rely on a single schema, graph analytical platforms allow for more flexible and efficient handling of complex relationships. Although this article focuses on the application of weakly connected components in bipartite graphs, it's important to note that DataWalk is not limited in this aspect and can efficiently compute clusters on any type of graph, providing versatility in handling diverse datasets.
DataWalk allows users to configure weakly connected components without needing to learn any programming language, unlike some graph databases. This no-code approach makes it accessible to a broader range of users.
Once the algorithm finishes its execution, the results are seamlessly integrated into the DataWalk knowledge graph, providing an efficient and effective way to expand your analytics.
DataWalk provides extensive graph metrics for analyzing communities, such as the number of rings, diameter, and maximum ring perimeter. These metrics help in understanding the properties and behaviors of different communities within the network.
Users can easily add business attributes and context to the detected communities, enhancing the richness and usability of the data for further analysis or machine learning. Example features added on top of extracted communities could include the number of fraudulent transactions inside the community, average value of transactions, number of suspicious accounts, etc. All of this can be easily created from DataWalk’s UI.
DataWalk enables visualization of communities, allowing users to quickly understand the structure and relationships within the extracted components. This capability is essential for identifying patterns and anomalies effectively. Users can select the communities for investigation based on the graph community metrics and additional business attributes described in previous sections.
DataWalk links the results to source objects, allowing you to easily identify which transactions and addresses belong to which components automatically.
To test the weakly connected components algorithm, the benchmark used a graph representation created from three datasets related to the Bitcoin blockchain. These datasets were:
The dataset comprised Bitcoin transactions and addresses from 2009 to 2021, spanning over 13 years of transaction history. To compare performance across different data volumes, the dataset was divided into multiple sub-datasets, each covering a single year of transactions and newly active addresses for that year.
We can think of addresses (BTC Address) and transactions aggregated (BTC TX) as nodes, and transaction inputs (BTC TX IN) as edges. This structure implies that two addresses may be linked through a shared transaction if they both initiated Bitcoin transfers within the same block:
We conducted 13 experiments to evaluate the connected components algorithm. The first experiment was run on data from 2009, and the second experiment included data from 2009 and 2010, this pattern continued until the final experiment, which used the entire dataset from 2009 to 2021. The final dataset consisted of approximately 1.5 billion nodes and edges.
This approach provided a comprehensive evaluation of the connected components algorithm across different data scales. By following this methodology, we tested scenarios where all the data fits into available RAM memory and scenarios where it does not. For each experiment, the connected components algorithm was run five times to ensure the consistency of the results. The algorithm was tested on single-node instances for all vendors. Additionally, we evaluated how DataWalk’s performance scales on three-node and six-node clusters.
To ensure a fair comparison between DataWalk and graph database vendors, the experiments were performed on standardized hardware, specifically Amazon EC2 instances with comparable specifications. Amazon R6i instances with 128GB of RAM (r6i.4xlarge) were used for all tests.
The following figure presents the average execution time for extracting connected components. The results are averaged over five runs of the algorithm every year for DataWalk and the other graph database tools.
DataWalk is the only platform that successfully completed all experiments on the r6i.4xlarge instance. The vertical lines in the picture highlight each vendor's failure points. The Vendor 2 was able to complete only 7 out of 13 experiments, and the Vendor 1 - 11 out of 13 experiments. The results show that the computation time for DataWalk increases linearly with the graph size, highlighting the platform's scalability and efficiency. This capability is particularly beneficial for scenarios where data cannot fit into RAM, demonstrating DataWalk's superior performance in handling large-scale datasets.
Previously, we tested the performance of DataWalk only in single-machine environments. In this section, we examine how DataWalk’s performance scales as the data is distributed across a cluster of machines. Specifically, we look at how increasing the number of database compute servers affects the load time and the speed of extracting connected components.
The following chart presents data load times in seconds for DataWalk’s 1, 3, and 6-node environments (each node being r6i.4xlarge instance).
The figure below shows the average execution time for extracting connected components using DataWalk’s 1, 3, and 6-node environments, averaged over five iterations of the experiment.
From the graph, we can conclude that adding computation nodes reduces the time of weakly connected components computation for graphs larger than 40M of nodes and edges.
DataWalk excels in handling large graphs that do not fit into RAM and workflows that include incremental data loading.
There are several advantages to using DataWalk:
Additionally, performance improvements are observed when using multiple database servers with DataWalk, both in terms of data load times and algorithm execution runtimes. This demonstrates that the connected components algorithm, as well as other algorithms, can be parallelized and scaled horizontally using DataWalk.