Resultado da Busca
Fast, accurate and scalable probabilistic data linkage. Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets without unique identifiers. Get Started with Splink.
To get a basic Splink model up and running, use the following code. It demonstrates how to: Estimate the parameters of a deduplication model. Use the parameter estimates to identify duplicate records. Use clustering to generate an estimated unique person ID. Simple Splink Model Example. import splink.comparison_library as cl from splink import ...
When building any linkage model in Splink, there are 3 key things which need to be defined: What type of linkage you want (defined by the link type) What pairs of records to consider (defined by blocking rules) What features to consider, and how they should be compared (defined by comparisons)
Bases: ComparisonCreator. Represents a comparison of the data in col_name with multiple levels based on absolute time differences: Exact match in col_name. Absolute time difference levels at specified thresholds. ... Anything else. For example, with metrics = ['day', 'month'] and thresholds = [1, 3] the levels are: Exact match in col_name.
24 de jul. de 2024 · Splink 4 code that uses the same settings will produce the same results (predictions) as Splink 3. That said, there have been significant changes to the syntax and a reorganisation of functions. For users wishing to familiarise themselves with Splink 4, we recommend the easiest way is to compare and contrast the new examples with their Splink 3 equivalents .
The result of linker.predict () is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as: A -> B with score 0.9 B -> C with score 0.95 C -> D with score 0.1 D -> E with score 0.99.
splink.clustering. Clusters the pairwise match predictions into groups of connected records using the connected components graph clustering algorithm. Records with an estimated match probability at or above threshold_match_probability are considered to be a match (i.e. they represent the same entity). If no match probability column is provided ...
Defining and customising how record comparisons are made. A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types. For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.
Given the variety of potential use cases of Splink, there is no perfect, universal model, just models that can be tuned to produce useful outputs for a given application. The tools within Splink are intended to help identify areas where your model may not be performing as expected. In future versions releases we hope to automatically flag where ...
The waterfall_chart shows the amount of evidence of a match that is provided by each comparison for a pair of records. Each bar represents a comparison and the corresponding amount of evidence (i.e. match weight) of a match for the pair of values displayed above the bar. What the chart tooltip shows. The tooltip contains information based on ...