-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using reweight.names in fastlink() returns only completely NA rows #62
Comments
Hi, Your code looks OK. Do you happen to have a reproducible example you could share with us? More than happy to take a look. All my best, Ted |
I wasn't able to create a reproducible scaled-down example, which led me to taking a SRS of the two datasets (10% of each) I'm working with to try again. This time, I received 18 rows back, of which 8 were NA and 10 were match rows. Is it possible the issue is linked to the size of data sets? (dfA has about 1k rows, dfB about 220k). |
Hi, Are there NAs in the name variable? All my best, Ted |
Ted, Did the check, no NAs. There were 2 "" blank strings. Once I filtered out for testing, I reran fastLink and got the same result as I described above. Appreciate your help. I'm going to keep looking into this in my spare time and see if any other data anomalies catch my attention that might trigger this issue. |
Disclaimer: I am a regular fastLink user, not a fastLink developer. Is the scaled-down dataset dfA about 1K rows or about 100 rows? Do the read in datasets look fine to you? Approximately how much missingness is there? How many exact matches are there? Can you show the linkage patterns for the 18 returned rows? No/Little overlap could be the cause... Anders |
I've run the fastLink function both with and without the reweight.names option to ensure the data is matched without issue otherwise.
Code:
fastLink(dfA = dfA, dfB = dfB, varnames = c("first", "last", "company"), stringdist.match = c("first", "last", "company"), stringdist.method = "lv", return.df = TRUE, reweight.names = TRUE, firstname.field = "first", dedupe.matches = FALSE, verbose = TRUE)
The matched data output includes NA cases; each field for each case is "NA":
Any idea what's gone wrong here? Thank you for looking into this.
The text was updated successfully, but these errors were encountered: