Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore empty fields while joining #1194

Open
fgvieira opened this issue Feb 15, 2023 · 10 comments
Open

Ignore empty fields while joining #1194

fgvieira opened this issue Feb 15, 2023 · 10 comments
Assignees

Comments

@fgvieira
Copy link

fgvieira commented Feb 15, 2023

Dear all,

I've only recently came across Miller but it is quite impressive!

However, I am now trying to join two CSV files by a column but would like to ignore empty fields. For example:

id,code
3,0000ff
2,00ff00
4,ff0000
,ffffff
,000000
id,color
4,red
2,green
,white
,black

Joining on color the results are as expected:

mlr --csv join -j id -f color-codes.csv color-names.csv

id,code,color
4,ff0000,red
2,00ff00,green

Is this possible?
thanks,

@aborruso
Copy link
Contributor

You have empty values in id field and you use it to make the join, then you will have it in join standard output.

But you can add the filter verb, and remove all records in which id is null

mlr --csv join -j id -f color-codes.csv then filter -x 'is_null($id)' color-names.csv

And you get

id,code,color
4,ff0000,red
2,00ff00,green

@fgvieira
Copy link
Author

But then how could I use other joinoptions?
For example, --ul:

$ mlr --csv join --ul -j id -f color-codes.csv then unsparsify --fill-with "" color-names.csv

should give:

id,code,color
4,ff0000,red
2,00ff00,green
3,0000ff,
,ffffff,
,000000,

@aborruso
Copy link
Contributor

But then how could I use other joinoptions?

I'm sorry, probably I don't understand what your goal is.
I thought you wanted to ignore id empty cells. But I'm wrong.

Could you add the sample output you want, starting from the example input?

Thank you

@fgvieira
Copy link
Author

Sorry, maybe I was not clear, but my goal would be to join to CSV files treating empty fields differently. As I see it, right now, empty fields are treated as a normal string, just empty.

I was wondering if it would be possible to treat them as "unknown" values (i.e. NA), in a way that the user could decide how to deal with them: either they never match anything (not even other empty fields) or that they always match.

An example of the first case would be the one on my previous comment: empty fields are not matched but kept from the left file.

@aborruso
Copy link
Contributor

Once again, probably, I don't understand what your goal is. And once again I'm sorry.

If you run a standard join, by id, Miller joins also the empty cells. You have 2 nulls rows in the left and 2 nulls rows in the right, then in the join output you have 2x2=4 rows

mlr --csv join -j id -f color-codes.csv  color-names.csv

image

If you want to add N/A values you can run

mlr --csv join -j id -f color-codes.csv then fill-empty color-names.csv

and you get

id,code,color
4,ff0000,red
2,00ff00,green
N/A,ffffff,white
N/A,000000,white
N/A,ffffff,black
N/A,000000,black

But I think I didn't answer you :(

@aborruso
Copy link
Contributor

If you run

mlr --csv join --ul -j id -f color-codes.csv then unsparsify then put 'if(is_null($id)) {$color=""}' then uniq -a then sort -n id color-names.csv

you get

id,code,color
2,00ff00,green
3,0000ff,
4,ff0000,red
,ffffff,
,000000,

@fgvieira
Copy link
Author

I am trying to treat empty strings as missing data, meaning that they are ignored when matching. For example:

==> color-codes.csv <==

id1,id2,code
1,a,0000ff
2,b,00ff00
,c,ffffff
,d,000000
5,e,ff0000
6,,ff33cc

==> color-names.csv <==

id1,id2,color
,a,blue
2,b,green
3,,white
,d,black
5,,red
,,pink

I'd like (order is not relevant):

$mlr --csv join -j id1,id2 -f color-codes.csv then unsparsify --fill-with "" color-names.csv
id1,id2,code,color
1,a,0000ff,blue
2,b,00ff00,green
,d,000000,black
5,e,ff0000,red

or:

$ mlr --csv join --ul -j id1,id2 -f color-codes.csv then unsparsify --fill-with "" color-names.csv
id1,id2,code,color
1,a,0000ff,blue
2,b,00ff00,green
,c,ffffff,
,d,000000,black
5,e,ff0000,red
6,,ff33cc,

@aborruso
Copy link
Contributor

I am trying to treat empty strings as missing data, meaning that they are ignored when matching.

ok, this is not a join. I think you should create a for loop Miller script

@johnkerl
Copy link
Owner

johnkerl commented Mar 1, 2023

@fgvieira sorry for the delay in replying. This can be done by the creation of some new flags for the join verb.

@johnkerl johnkerl self-assigned this Mar 1, 2023
@johnkerl johnkerl changed the title ignore empty fields while joining Ignore empty fields while joining Mar 2, 2023
@johnkerl
Copy link
Owner

Oops wrong issue sorry!

@johnkerl johnkerl reopened this Aug 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@aborruso @fgvieira @johnkerl and others