fuzzystring provides fast, flexible fuzzy string
joins for data.frame and data.table objects
using approximate string matching. Built on top of
data.table and stringdist, it uses compiled
C++ result assembly plus adaptive candidate planning to reduce
unnecessary distance evaluations in single-column joins.
You can install fuzzystring from CRAN:
You can also install the development version from GitHub:
Here’s a simple example matching diamond cuts with slight misspellings:
# Your messy data
x <- data.frame(
name = c("Idea", "Premiom", "Very Good"),
id = 1:3
)
# Reference data
y <- data.frame(
approx_name = c("Ideal", "Premium", "VeryGood"),
grp = c("A", "B", "C")
)
# Fuzzy join with max distance of 2 edits
fuzzystring_inner_join(
x, y,
by = c(name = "approx_name"),
max_dist = 2,
distance_col = "distance"
)
#> name id approx_name grp distance
#> 1 Idea 1 Ideal A 1
#> 2 Premiom 2 Premium B 1
#> 3 Very Good 3 VeryGood C 1fuzzystring supports all standard join types. Below is a small, reusable example dataset so you can compare the behavior of each join family.
x_join <- data.frame(
name = c("Idea", "Premiom", "Very Good", "Gooood"),
id = 1:4
)
y_join <- data.frame(
approx_name = c("Ideal", "Premium", "VeryGood", "Good"),
grp = c("A", "B", "C", "D")
)fuzzystring_inner_join(): Only matching rows.fuzzystring_left_join(): All rows from x,
matching rows from y.fuzzystring_right_join(): All rows from y,
matching rows from x.fuzzystring_full_join(): All rows from both
tables.fuzzystring_semi_join(): Rows from x that
have a match in y.fuzzystring_anti_join(): Rows from x that
don’t have a match in y.x with a match in
y)x without a match in
y)fuzzystring_join()If you prefer a single entry point, you can use
fuzzystring_join() directly by specifying
mode.
You can choose from various distance metrics provided by the
stringdist package:
# Optimal String Alignment (default)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa")
# Damerau-Levenshtein
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl")
# Jaro-Winkler (good for names)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw")
# Soundex (phonetic matching)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex")You can match on multiple string columns at once. The same distance method and threshold are applied to each mapped column.
x_multi <- data.frame(
first = c("Jon", "Maira"),
last = c("Smyth", "Gonzales")
)
y_multi <- data.frame(
first_ref = c("John", "Maria"),
last_ref = c("Smith", "Gonzalez"),
customer_id = 1:2
)
fuzzystring_inner_join(
x_multi, y_multi,
by = c(first = "first_ref", last = "last_ref"),
method = "osa",
max_dist = 1
)fuzzystring now keeps more of the join execution on
a compiled C++ path while using data.table to orchestrate
candidate generation. In practice this means compiled row expansion and
binding across join modes, better preservation of typed columns, and
adaptive candidate planning that helps both duplicate-heavy and
low-duplication workloads.
For a dedicated comparison against
fuzzyjoin::stringdist_join(), see the benchmark article
bundled with the package.