Tier Matching

Overview

tier_match is the ultimate wrapper function in fedmatch. tier_match puts together all of the pieces from the package into one function, letting the user perform many matches in one call. The function is excellent both as an exploratory tool, while the user is still figuring out how they want to execute their matches, and as a final matching tool that can be used in production code.

‘tiers’ of a match are useful because there are hierarchies of matches. An exact name match between two companies is a higher-quality match than a fuzzy match, and fuzzy matches with various levels of cleaning can be different levels of quality.

Syntax

The syntax of tier_match is providing a core list of arguments to the function itself, and then passing a named list to the tier match. Each element in this list is itself a list, each of which is a tier to match on, and it contains all of the arguments necessary for that tier. All of these arguments will be passed to ‘merge_plus’ in sequence, and each of the matches from each tier are saved and combined.

tier_list <- list(
  a = build_tier(match_type = "exact"),
  b = build_tier(match_type = "fuzzy"),
  c = build_tier(match_type = "multivar", multivar_settings = build_multivar_settings(
    logit = NULL, missing = FALSE, wgts = 1,
    compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
    top = 1, threshold = NULL
  ))

)
# tier_list

This list will perform three matches: ‘a’, an exact match; ‘b’, a fuzzy match, and ‘c’, a multivar match. We can get a bit fancier and add more settings to each, if we’d like. Remember that each element of each tier has to be an argument for merge_plus.

tier_list_v2 <- list(
  a = build_tier(match_type = "exact", clean = TRUE),
  b = build_tier(match_type = "fuzzy", clean = TRUE,
           fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard",
                                 maxDist = .7,
                                 nthread = 1),
           clean_settings = build_clean_settings(remove_words = TRUE)),
  c = build_tier(match_type = "multivar", 
                 multivar_settings = build_multivar_settings(
    logit = NULL, missing = FALSE, wgts = 1,
    compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
    top = 1, threshold = NULL
  ))
)

Let’s take a look at the rest of the syntax for tier_match:

result <- tier_match(corp_data1, corp_data2,
  by.x = "Company", by.y = "Name",
  unique_key_1 = "unique_key_1", unique_key_2 = "unique_key_2",
  tiers = tier_list_v2, takeout = "neither", verbose = TRUE,
  score_settings = build_score_settings(score_var_x = "Company",
                                        score_var_y = "Name",
                                        wgts = 1,
                                        score_type = "stringdist")
)
#> Matching tier 'a'...
#> Time elapsed: 0.01 secs.
#> Matching tier 'b'...
#> Time elapsed: 0.01 secs.
#> Matching tier 'c'...
#> Time elapsed: 0.04 secs.

There are two types of arguments for tier_match: those that can be passed to merge_plus, and those that are unique to tier_match. If anything of the merge_plus arguments are listed in tier_match directly (rather than in tier_list), those arguments are used in every tier. In this example, we are always matching on ‘Company’ and ‘Name,’ so those are placed in the arguments for tier_match directly. The arguments unique to tier_match and their defaults are:

  • tiers is the tier list create by iterations of build_tier(). Required, no default.
  • takeout is a character vector, either “neither”, “both”, “data1”, or “data2”. These settings describe whether or not to take out matches in between each tier, and if so, what dataset to remove the matches for.
  • verbose is a boolean. If TRUE, prints tier names and time taken to match each tier.

The other arguments are all present in merge_plus, see documentation there for details.

The result for tier_match is a list with 4 items: the matched dataset, the unmatched data, and a match evaluation. Here’s what the matches look like:

result$matches[1:5]
#> Key: <unique_key_2>
#>             Company Country  State   SIC Revenue unique_key_1 country
#>              <char>  <char> <char> <num>   <num>        <int>  <char>
#> 1:          walmart     USA     OH  3300     485            1     USA
#> 2:          walmart     USA     OH  3300     485            1     USA
#> 3:          Walmart     USA     OH  3300     485            1     USA
#> 4: Bershire Hataway     USA         2222     223            2     USA
#> 5:            apple     USA     CA  3384     215            3     USA
#>    state_code SIC_code earnings unique_key_2              Name matchscore
#>        <char>    <num>   <char>        <int>            <char>      <num>
#> 1:         OH     3380  490,000            1           walmart  1.0000000
#> 2:         OH     3380  490,000            1           walmart  1.0000000
#> 3:         OH     3380  490,000            1           Walmart  1.0000000
#> 4:         NE     2220  220,000            2 Bershire Hathaway  0.9882353
#> 5:         CA       NA  220,000            3    apple computer  0.8714286
#>    Company_score   tier Company_compare multivar_score
#>            <num> <fctr>           <num>          <num>
#> 1:     1.0000000      a              NA             NA
#> 2:     1.0000000      b              NA             NA
#> 3:     1.0000000      c       1.0000000      1.0000000
#> 4:     0.9882353      c       0.9882353      0.9882353
#> 5:     0.8714286      b              NA             NA

As you can see, the matches dataset has a column called ‘tier’ that indicates which tier the match was from. It also adds any additional columns added by the matching process. In this example, we see ‘Company_score’, created from the from the post-hoc scoring; ‘wgt_jaccard_sim’, the Weighted Jaccard similarity, created when using the ‘wgt_jaccard’ setting of fuzzy_match (see the ‘Fuzzy-matching’ vignette for more details); and ‘Company_compare’, created from the multivar matching tier.

We also have a match evaluation, now filled out with more details broken down by tier:

result$match_evaluation
#>      tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#>    <fctr>   <int>            <int>            <int>         <num>         <num>
#> 1:      a       2                2                2           0.2           0.2
#> 2:      b       7                7                7           0.7           0.7
#> 3:      c      10               10                9           1.0           0.9
#> 4:    all      19               10                9           1.0           0.9
#>    new_unique_1 new_unique_2
#>           <int>        <int>
#> 1:            2            2
#> 2:            5            5
#> 3:            3            2
#> 4:           NA           NA

We can use this evaluation to figure out which tiers did the ‘best’ job matching, getting the most unique matches.