Skip to content

Usability/speed improvementsΒ #40

Open
@jveitchmichaelis

Description

Hi,

Thanks for providing this - it's very helpful. I think a couple of things might be helpful from a UI/usability perspective:

  • Filtering by tag would be helpful. A typical use case for me is only e.g. loss curve and a couple of validation outputs. I've added this as a command line option (--tags), and a minor modification to the code:
# Same check for missing tags, just check for `filter_tags` instead:
if not filter_tags:
    all_tags = {tag for tags in tags_in_each_dir for tag in tags}
else:
    all_tags = set(filter_tags)

...

tag_list = accumulator.scalar_tags if not filter_tags else filter_tags

... # process everything as usual

you have to load all events anyway, but it makes output files a lot cleaner if you only want one or two things. If you store e.g. device stats, that can be tens of unnecessary columns in the output.

  • It seems like it takes ages to load an event file. For some operations like checking for tags, this can mean you wait for 10 minutes before the script crashes (if lax options aren't enabled). From a look at the code, the entire event file is processed in order to build up e.g. a scalar tag list. I've not been able to beat that, but I wonder how Tensorboard seems to give an instant overview (even if it doesn't show all the points, it does get a tag list basically instantly - is it cached somewhere?)
  • Support parallel event loading. This is pretty easy, but has a dependency on multiprocess rather than multiprocessing for pickle support. It's basically a linear speed up which can be pretty significant (e.g. my files take 90 seconds to load and I have 10). I think the RAM requirement doesn't make a difference because currently the code loads the events simultaneously into memory anyway.
def load_event_mp(dirname):
    return EventAccumulator(dirname).Reload()

import multiprocess as mp
with mp.Pool(processes=num_processes) as pool:
    accumulators = pool.map(load_event_mp, input_dirs)

A typical CLI invocation would be:

tb-reducer /mnt/data/logs/* --lax-steps -o /mnt/data/aggregate.csv -r mean,std,min,max --tags train/loss --verbose --n_proc 16

I'd be happy to PR these.

Cheers

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperfPerformance issues or improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions