Entity resolution is essential in integrating data across multiple databases and disparate data sources. Its primary function is to link records representing the same real-world entity into a single object within a database. This procedure, known either as deduplication or record linkage, is essential for maintaining data integrity and building an organization's complete perspective of information - for customers, organizations, or even products. Since entity resolution encompasses all types of data points, it affects every stage of analytical projects. This ranges from fundamental tasks, such as identifying and accurately profiling customers, to more complex activities, like detecting and preventing fraud, or recommending the right products for customers in e-commerce channels.
Entity resolution methods span a variety of techniques. These range from deterministic matching, which depends on exact data matches and rules specific to an organization, through matching based on phonetic algorithms, to probabilistic matching, which employs statistical models to deduce connections between data points. Graph analytics and machine learning approaches enriched with AI capabilities such as Large Language Models (LLMs) can improve accuracy by identifying linkages undiscovered with rules. Entity resolution utilizing multiple techniques can be implemented via a knowledge graph that provides a unified view of the data, and enhances the understanding of linked records, revealing less obvious connections between data.
Knowledge graphs facilitate understanding of a dynamic and structured representation of organizational data through graph-based models that incorporate a semantic layer into a data model. This semantic layer enhances the understanding of data by defining entities' types, properties, and relationships within the graph, making the data model more expressive and aligned with real-world concepts. Entity resolution, when combined with a knowledge graph that leverages a semantic layer, significantly improves the data quality and the outcomes achievable from analytics. It ensures that links between entities are accurate and represent the same entity as a single object, thereby making the network of data more meaningful and navigable.
There are many tools available for creating an entity-resolved knowledge graph, but the real challenge is in finding a solution that strikes the right balance between technical demands and the user experience for business professionals. Achieving this balance is crucial for optimizing the end-to-end entity resolution process, which involves both technical and business stakeholders. The former plays a crucial role in applying all techniques and overseeing quality, while the latter, who work directly with the data, have the ability to enhance entity definitions by identifying nuances that algorithms may overlook.
In the comparison, we will explore how Neo4j's entity resolution strategy caters primarily to technical users, as opposed to DataWalk's approach, which aims to meet the needs of both technical and business users within its knowledge graph and analytical framework.
Unified database
Since entity resolution involves all data within an organization and is ongoing process, it typically necessitates processing vast amounts of data in a scalable manner on multiple occasions.
Neo4j entity resolution with use of analytics involves leveraging the Graph Data Science Library (Neo4j Python's library for graph algorithms usage). This approach necessitates the projection of data, meaning that specific subsets of the database are temporarily structured to facilitate the execution of graph algorithms for entity resolution. This step adds complexity to the workflow, as users must define and manage these projections to target their analysis effectively. With big data volumes, it could be extremely challenging, as before entity resolution analysis users need to decide the data scope for analysis.
On the other hand, DataWalk addresses the entity resolution process by offering access to matching algorithms and the possibility of incorporating more advanced algorithms, all within one unified database. This approach allows users to perform entity resolution directly on the database without any preliminary data selection. Upon defining the parameters, users can immediately observe its impact on specific objects and their clusters—highlighting connections and uncovering new information. DataWalk is designed to efficiently manage these tasks on a large scale, processing the entire database to not only analyze specific cases but also to provide comprehensive results across all data.
External and ad-hoc data integration
While operating on a single database is vital for entity resolution, the capability to cross-reference with external data sources is equally important. DataWalk, as a graph analytical platform, facilitates entity resolution with external data sources — including both structured and unstructured data— through a user-friendly feature enabling you to drag-and-drop data sets. In contrast, Neo4j demands coding skills to import data into the graph database and then resolve entities using Cypher (Neo4j coding language) or the Graph Data Science Library.
When business users need to enhance their analysis with new information in Excel files, or have ad-hoc tasks that are not necessarily reflected in the data model, they can just load data into DataWalk via drag-and-drop. A good example is Pandora Papers articles, where with most tools it was a challenging task to integrate all of those articles with organizations’ databases. In contrast, with DataWalk, you can just use the framework to upload data, extract entities, and then link it with nodes already implemented in a data model.
From the end-user's perspective, DataWalk offers greater flexibility, eliminating the need for developer involvement, which can sometimes take weeks, to process entity resolution and examine results in multiple ways. Finally, it shortens reaction time in ad-hoc tasks that sometimes require quick decisions.
No-code entity resolution
It is important to offer business users increased flexibility by enabling them to directly explore the data. While Neo4j is a graph database with its own coding language, its entity resolution requires coding expertise. As entity resolution is usually in the business domain, it is hard for these users to define and test their hypotheses on how entities should match. DataWalk addresses that need with a visual no-code interface that allows approaches for matching entities to be prototyped through a few simple clicks. DataWalk incorporates a range of sophisticated entity resolution techniques into its platform, such as hashing algorithms like Soundex, Double Metaphone, and Eudex, as well as similarity functions like date range matching, Levenshtein distance, Jaccard distance, and others. Furthermore, power users have the ability to integrate functions directly within the database, including both hashing and matching, as well as other essential operations. This enables them to customize the process of matching objects in a scalable manner.
All of these methods can be combined to achieve greater accuracy. Users can quickly prototype suitable approach for the entire organization, as well as for their own use cases.
Entity resolution workflow
Merging different techniques by business users and incorporating end-user feedback makes entity resolution a process that demands a well-structured workflow. Neo4j, as a graph database, provides the necessary tools for performing entity resolution, whereas the DataWalk platform offers a holistic view of the entire entity resolution process. Graph visualization in DataWalk is essential for comprehensively understanding the connections within entity-resolved knowledge graphs, showcasing how records link within the structured data. In the interface users can check records comprising a merged entity, propose merges or unmerges as needed, or tailor actions for their specific use cases. Each action is trackable and auditable through the user interface. Neo4j also offers a visual representation of graphs, but it lacks the same level of flexibility and transparency of actions, as all actions require coding.
Improving entity resolution with advanced analytics
In addition to comparing data between nodes through rules and matching algorithms, machine learning algorithms can enhance the accuracy of entity resolution. The Graph Data Science Library in Neo4j allows this, as does DataWalk’s integration with Jupyter Notebook. With Jupyter Notebook in DataWalk users can execute graph algorithms, utilize graph embeddings, integrate with LLMs to find similarities between entities based on text description of nodes and relationships, or implement their own ideas and extensions.
Leveraging graph algorithms in entity resolution enhances the utilization of knowledge graphs and graph data structures. This approach enables the identification of different entities as a single entity by analyzing their relationships through graph embeddings and converting these into node similarity scores. For example, people connected to the same address, phone number and company could indicate the same person based on a graph structure.
From machine learning techniques, text embeddings also significantly contribute to entity resolution. By generating text embeddings from descriptions of products or people, it's possible to discover similarities between nodes that go beyond their explicit attributes. DataWalk employs a locally hosted embeddings model (open source or provided by customer organizations), which does not require internet access, to compute embeddings for every item in a selected set.
Furthermore, DataWalk also allows the use of Large Language Models (LLMs) which can provide deeper context to text, uncovering meanings that may not be readily apparent.
Based on the methods mentioned, a similarity score can be generated and combined to increase accuracy. However, it's crucial to set a proper threshold for matching entities when it comes to these scores. DataWalk's graphical interface allows users to examine the distribution of results and their impact on the knowledge graph, enabling them to establish an appropriate threshold. This threshold can be easily adjusted within the interface, making the process straightforward and user-friendly. This approach provides control over the process without sacrificing accuracy, thanks to the application of more advanced techniques for entity resolution. On the other hand, Neo4j demands analytical and coding expertise to properly adjust thresholds, which could be viewed as a drawback from a business perspective.
In this article, we highlighted critical differences between Neo4j and DataWalk regarding entity resolution. We primarily aimed at evaluating the entire entity resolution process, considering it as an ongoing task within organizations where data sources and techniques for enhancing the process are constantly evolving. Therefore, it is essential to select a solution that offers flexibility and transparency and ensures accuracy by integrating various techniques. DataWalk stands out by offering a knowledge graph and a user-friendly interface that incorporates embedded algorithms and allows entity resolution without scripting, thus catering to the needs of users without coding expertise. On the other hand, Neo4j is more suited to technical users, emphasizing coding and graph database manipulation, and thus requires more specialized knowledge for effective entity resolution.