Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using reweight.names in fastlink() returns only completely NA rows #62

Open
brittlh opened this issue Jul 20, 2022 · 5 comments
Open

Using reweight.names in fastlink() returns only completely NA rows #62

brittlh opened this issue Jul 20, 2022 · 5 comments

Comments

@brittlh
Copy link

brittlh commented Jul 20, 2022

I've run the fastLink function both with and without the reweight.names option to ensure the data is matched without issue otherwise.

Code:

fastLink(dfA = dfA, dfB = dfB, varnames = c("first", "last", "company"), stringdist.match = c("first", "last", "company"), stringdist.method = "lv", return.df = TRUE, reweight.names = TRUE, firstname.field = "first", dedupe.matches = FALSE, verbose = TRUE)

The matched data output includes NA cases; each field for each case is "NA":

image

Any idea what's gone wrong here? Thank you for looking into this.

@tedenamorado
Copy link
Collaborator

Hi,

Your code looks OK. Do you happen to have a reproducible example you could share with us? More than happy to take a look.

All my best,

Ted

@brittlh
Copy link
Author

brittlh commented Jul 21, 2022

I wasn't able to create a reproducible scaled-down example, which led me to taking a SRS of the two datasets (10% of each) I'm working with to try again. This time, I received 18 rows back, of which 8 were NA and 10 were match rows. Is it possible the issue is linked to the size of data sets? (dfA has about 1k rows, dfB about 220k).

@tedenamorado
Copy link
Collaborator

Hi,

Are there NAs in the name variable?

All my best,

Ted

@brittlh
Copy link
Author

brittlh commented Aug 12, 2022

Ted,

Did the check, no NAs. There were 2 "" blank strings. Once I filtered out for testing, I reran fastLink and got the same result as I described above.

Appreciate your help. I'm going to keep looking into this in my spare time and see if any other data anomalies catch my attention that might trigger this issue.

@aalexandersson
Copy link

Disclaimer: I am a regular fastLink user, not a fastLink developer.

Is the scaled-down dataset dfA about 1K rows or about 100 rows? Do the read in datasets look fine to you? Approximately how much missingness is there? How many exact matches are there? Can you show the linkage patterns for the 18 returned rows? No/Little overlap could be the cause...

Anders

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants