Feature request: add flag to uniq to keep other columnsΒ #1075
Description
First of all, thanks for this amazing tool! It's incredibly helpful and makes my old data processing scripts look like a shameful mess π
Problem
I have a table of model fits, similar to the following:
mlr -t --from models.tsv head -n 4
model | score | hash | residual |
---|---|---|---|
59 | 431.674225795181 | 0b9877754c1eb555da41f4ba1535971d | 13.0566078428924 |
46 | 431.686883771386 | b636c65972858d19e704de0eeccf596d | 13.0566078428927 |
53 | 431.733440540124 | c6bac1fcdc9c7c7329b4bcb9eaaa7100 | 13.0566078428927 |
47 | 431.736046806062 | f5c83b037e50307098333d9077a64719 | 13.0566078428926 |
I'd like to do some processing with miller, like this:
mlr -t --from models.tsv sort -n score then filter '$residual < 4' then uniq -f hash
hash |
---|
0b9877754c1eb555da41f4ba1535971d |
b636c65972858d19e704de0eeccf596d |
c6bac1fcdc9c7c7329b4bcb9eaaa7100 |
f5c83b037e50307098333d9077a64719 |
The idea is to select best models by likelihood (score
), apply threshold on residual
and then deduplicate models by hash
, as models with the same hash
have the same topology, so I can just keep the fit with the best likelihood score
. The problem is that miller will output just the column used by uniq -c
, i.e. hash
.
Proposal
Add a new parameter to uniq
(and count-distinct
), e.g. -k
, which will keep the rest of the columns after deduplication on some field. the param could even accept comma-separated list of columns to keep, keeping all by default.
Notes
Note that I can achieve the same by piping to awk:
mlr -t --from models.tsv sort -n score then filter '$residual < 4' | awk '!a[$3]++' | mlr --t2m head -n 4
model | score | hash | residual |
---|---|---|---|
90 | 10.0072275978147 | 0932dd9b3381d57da0157310229afb60 | 1.67068471265348 |
61 | 10.0072326462779 | 3d9babf8f0ecb7cddadbcfa20c358d48 | 1.67171955534361 |
163 | 10.0123984271416 | be7fd5ff40f73afadcd29dd5f09ccfe6 | 1.67058665938258 |
58 | 10.0127850674847 | cb6bdd043a131b41fa58d0adeb619b0b | 1.66828481586713 |