Skip to content

Feature request: add flag to uniq to keep other columnsΒ #1075

Open
@janxkoci

Description

First of all, thanks for this amazing tool! It's incredibly helpful and makes my old data processing scripts look like a shameful mess πŸ˜†

Problem

I have a table of model fits, similar to the following:

mlr -t --from models.tsv head -n 4
model score hash residual
59 431.674225795181 0b9877754c1eb555da41f4ba1535971d 13.0566078428924
46 431.686883771386 b636c65972858d19e704de0eeccf596d 13.0566078428927
53 431.733440540124 c6bac1fcdc9c7c7329b4bcb9eaaa7100 13.0566078428927
47 431.736046806062 f5c83b037e50307098333d9077a64719 13.0566078428926

I'd like to do some processing with miller, like this:

mlr -t --from models.tsv sort -n score then filter '$residual < 4' then uniq -f hash
hash
0b9877754c1eb555da41f4ba1535971d
b636c65972858d19e704de0eeccf596d
c6bac1fcdc9c7c7329b4bcb9eaaa7100
f5c83b037e50307098333d9077a64719

The idea is to select best models by likelihood (score), apply threshold on residual and then deduplicate models by hash, as models with the same hash have the same topology, so I can just keep the fit with the best likelihood score. The problem is that miller will output just the column used by uniq -c, i.e. hash.

Proposal

Add a new parameter to uniq (and count-distinct), e.g. -k, which will keep the rest of the columns after deduplication on some field. the param could even accept comma-separated list of columns to keep, keeping all by default.

Notes

Note that I can achieve the same by piping to awk:

mlr -t --from models.tsv sort -n score then filter '$residual < 4' | awk '!a[$3]++' | mlr --t2m head -n 4
model score hash residual
90 10.0072275978147 0932dd9b3381d57da0157310229afb60 1.67068471265348
61 10.0072326462779 3d9babf8f0ecb7cddadbcfa20c358d48 1.67171955534361
163 10.0123984271416 be7fd5ff40f73afadcd29dd5f09ccfe6 1.67058665938258
58 10.0127850674847 cb6bdd043a131b41fa58d0adeb619b0b 1.66828481586713

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions