---
title: "Tier Matching"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Tier Matching}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup, include = F}
library(fedmatch)
library(data.table)
data("corp_data1", package = "fedmatch")
data("corp_data2", package = "fedmatch")
data.table::setDT(corp_data1)
data.table::setDT(corp_data2)
```
# Overview
`tier_match` is the ultimate wrapper function in `fedmatch.` `tier_match` puts together all of the pieces from the package into one function, letting the user perform many matches in one call. The function is excellent both as an exploratory tool, while the user is still figuring out how they want to execute their matches, and as a final matching tool that can be used in production code.
'tiers' of a match are useful because there are hierarchies of matches. An exact name match between two companies is a higher-quality match than a fuzzy match, and fuzzy matches with various levels of cleaning can be different levels of quality.
# Syntax
The syntax of `tier_match` is providing a core list of arguments to the function itself, and then passing a named list to the tier match. Each element in this list is itself a list, each of which is a tier to match on, and it contains all of the arguments necessary for that tier. All of these arguments will be passed to 'merge_plus' in sequence, and each of the matches from each tier are saved and combined.
```{r}
tier_list <- list(
a = build_tier(match_type = "exact"),
b = build_tier(match_type = "fuzzy"),
c = build_tier(match_type = "multivar", multivar_settings = build_multivar_settings(
logit = NULL, missing = FALSE, wgts = 1,
compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
top = 1, threshold = NULL
))
)
# tier_list
```
This list will perform three matches: 'a', an exact match; 'b', a fuzzy match, and 'c', a multivar match. We can get a bit fancier and add more settings to each, if we'd like. Remember that each element of each tier has to be an argument for `merge_plus`.
```{r}
tier_list_v2 <- list(
a = build_tier(match_type = "exact", clean = TRUE),
b = build_tier(match_type = "fuzzy", clean = TRUE,
fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard",
maxDist = .7,
nthread = 1),
clean_settings = build_clean_settings(remove_words = TRUE)),
c = build_tier(match_type = "multivar",
multivar_settings = build_multivar_settings(
logit = NULL, missing = FALSE, wgts = 1,
compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
top = 1, threshold = NULL
))
)
```
Let's take a look at the rest of the syntax for `tier_match`:
```{r}
result <- tier_match(corp_data1, corp_data2,
by.x = "Company", by.y = "Name",
unique_key_1 = "unique_key_1", unique_key_2 = "unique_key_2",
tiers = tier_list_v2, takeout = "neither", verbose = TRUE,
score_settings = build_score_settings(score_var_x = "Company",
score_var_y = "Name",
wgts = 1,
score_type = "stringdist")
)
```
There are two types of arguments for `tier_match`: those that can be passed to `merge_plus`, and those that are unique to `tier_match`. If anything of the `merge_plus` arguments are listed in `tier_match` directly (rather than in `tier_list`), those arguments are used in every tier. In this example, we are always matching on 'Company' and 'Name,' so those are placed in the arguments for tier_match directly. The arguments unique to `tier_match` and their defaults are:
- `tiers` is the tier list create by iterations of `build_tier()`. Required, no default.
- `takeout` is a character vector, either "neither", "both", "data1", or "data2". These settings describe whether or not to take out matches in between each tier, and if so, what dataset to remove the matches for.
- `verbose` is a boolean. If `TRUE`, prints tier names and time taken to match each tier.
The other arguments are all present in `merge_plus`, see documentation there for details.
The result for tier_match is a list with 4 items: the matched dataset, the unmatched data, and a match evaluation.
Here's what the matches look like:
```{r}
result$matches[1:5]
```
As you can see, the matches dataset has a column called 'tier' that indicates which tier the match was from. It also adds any additional columns added by the matching process. In this example, we see 'Company_score', created from the from the post-hoc scoring; 'wgt_jaccard_sim', the Weighted Jaccard similarity, created when using the 'wgt_jaccard' setting of `fuzzy_match` (see the 'Fuzzy-matching' vignette for more details); and 'Company_compare', created from the multivar matching tier.
We also have a match evaluation, now filled out with more details broken down by tier:
```{r}
result$match_evaluation
```
We can use this evaluation to figure out which tiers did the 'best' job matching, getting the most unique matches.