Design of Optimizer Checkpointing for Gemini Plugin #4140

Fridge003 · 2023-07-03T08:57:48Z

Fridge003
Jul 3, 2023

Background

Chunk+Gemini：paper

Checkpoint System Design: github discussion

Problems

Gemini is developed as a parallel training strategy based on ZeRo algorithm. In order to improve communication efficiency, Gemini allocates a chunk to each model parameter, where each chunk is filled with a fixed number of tensor elements. Data are transmitted as chunks, so that bandwidth can be utilized more properly and data locality can also be enhanced.

However, while developing checkpointing features (save & load) for Gemini's optimizer, we found that the mechanism of 'chunks' might induce the following issues:

To keep compatible with pytorch, the state_dict of optimizer needs to map each model parameter into an integer ID. When initializing an optimizer, Pytorch will configure its param_groups member variable according to the list of parameters passed in by user, where param_groups decides the mapping relationship between parameters and integer IDs (the relationship is decided by the order of appearance in passed in parameter list), as well as the hyperparameters adopted by each parameter. However, Gemini algorithm will modify param_groups configured by pytorch before training, leading to greater complexity for parameter management.
Each chunk might be broken into shards and distributed on different devices under ZeRo3 setting. In this way, the tensors loaded in these chunks might also exist in the form of shards, and so do theirs optimizer states. However, the HybridAdam optimizer used by Gemini doesn't collect the shards of states when calling step(), but only computing the shard of states on local device instead.
The mapping relationship between model parameters and device rank is not constant. If the configuration of devices changes (e.g., the number of GPUs changes from 4 to 2), or the setting of Gemini model changes (e.g., the search range of chunk size), the size and distribution of chunks might change accordingly, affecting the device where each parameter lies on. Thus our design should be agnostic to the condition of devices.

Solutions

Management of `param_groups`

We assume that the configuration of passed in model parameters and hyperparameters doesn't change during saving & loading. This loose assumption is the foundation of our design. In fact, the checkpointing system of Pytorch is also based on this assumption.

Gemini calls self.__init__optimizer() method during initialization of ZeroOptimizer class (the class of optimizer wrapper it uses). This method will modify self.param_groups set by Pytorch as following:

# in colossalai/zero/gemini/gemini_optimizer.py
def __init__optimizer(self):
    ...
    for group in self.optim.param_groups:
        fake_params_list = list()
        for param in group['params']:
            ...
            fake_param = torch.nn.Parameter(torch.empty([0], device=grad_device))
            self.param_to_chunk32[fake_param] = chunk16.paired_chunk
            self.param_to_range[fake_param] = range_pair
            fake_params_list.append(fake_param)

        group['params'] = fake_params_list

In this code segment, the value corresponding to key 'params' in each param_group is replaced with a fake_param_list, where each fake_param is a dummy tensor. Meanwhile, parameters not stored on local device are wipped out from self.param_groups. So the original implementation of Gemini will undermine information of param_groups, leading to the first issue mentioned above.

To address this issue, we maintain several member variables in ZeroOptimizer class:

# A copy of param_groups information.
self.param_groups_backup: List[Dict] = list()

# Mapping from integer id to real/fake param tensor, used for checkpointing.
self.id_to_real_params: Dict[int, Parameter] = dict()
self.id_to_fake_params: Dict[int, Parameter] = dict()

In the for loop of modified self.__init__optimizer() method, each parameter is traversed in the order of original param_groups. We can naturally obtain their IDs and mapping from ID to parameter object (this is recorded at self.id_to_real_params). Meanwhile, the information of param_groups will be backuped at self.param_groups_backup. If the current parameter will be added to fake_param_list of current process, a mapping from ID to its fake_param object will be added to self.id_to_fake_params accordingly.

# in colossalai/zero/gemini/gemini_optimizer.py
def __init__optimizer(self):
    ...
    param_id = -1
    for group in self.optim.param_groups:
        fake_params_list = list()
        group_backup = {k: v for k, v in group.items() if k != 'params'}
        group_ids = []
        for param in group['params']:

            # Record the mapping of id to current param.
            param_id += 1
            self.id_to_real_params[param_id] = param
            group_ids.append(param_id)

            # If current param is controlled by current process, add it to fake_param.
            ...
            fake_param = torch.nn.Parameter(torch.empty([0], device=grad_device))
            self.param_to_chunk32[fake_param] = chunk16.paired_chunk
            self.param_to_range[fake_param] = range_pair
            self.id_to_fake_params[param_id] = fake_param
            fake_params_list.append(fake_param)

        # Update self.optim.param_groups as well as backup group.
        group['params'] = fake_params_list
        group_backup['params'] = group_ids
        self.param_groups_backup.append(group_backup)

In this way, we can conveniently check whether a parameter is managed by current process through command

param_id in self.id_to_fake_params

To get the fake_param object or real parameter object corresponding to an integer ID, just use command

fake_param = self.id_to_fake_params[param_id]
real_param = self.id_to_real_params[param_id]

In this way, any necessary information of parameter can be obtained with given parameter ID.

Method of Collecting Optimizer States

As is mentioned in the second issue above, the optimizer states of the same parameter can be distributed among different devices. To obtain the integral optimizer states before saving it to checkpoint, we designate the device with rank 0 to be the manager that gathers the states shards and write them to disk. To implement this idea, we design a method for collecting shards of optimizer states:

# in colossalai/zero/gemini/gemini_optimizer.py

# following is pseudocode
def collect_states(self, param_id: int, only_rank_0: bool):
    '''
    Args:
         param_id (int): id of the parameter whose state is to be gathered at master rank.
         only_rank_0(bool): if True, states will be collected only on master rank, otherwise collected on every rank.
     Returns:
         collected_states(dict): the gathered optimzier state of parameter with given id
    '''
    fake_param = self.id_to_fake_params[param_id]
    states = self.optim.state[fake_param]
    
    # Boolean variable is_collector indicates that whether the current rank
    # needs to gather the whole optimizer states.
    # Only master rank is collector when only_rank_0 is True.
    # Every rank is collector when only_rank_0 is False.
    is_collector = (rank == master_rank) or (not only_rank_0)
    collected_states = {}
    
    # Pack the shards of all states into a compacted tensor
    compacted_states = pack_state_shards_into_tensor(param_id)
    barrier()
    
    # Collect through all_gather
    gathered_shards = [None for _ in range(world_size)]
    all_gather(gathered_shards, packed_state_shards)
    
    # Update states using collected shards
    if is_collector:
       collected_states = update_collected_states(gathered_shards)
         
    return collected_states

In this method, first use variable is_collector to check whether the current rank needs to collect complete states (by default only master rank needs). Then pack the state shards on local device into a compacted tensor, and communicate with other ranks using torch.distributed.all_gather_objects API. After all the ranks receive the complete information of optimizer states, the ranks whose is_collector is True updates collected states and return.

Sharding

The sharding feature demands that the checkpoint of optimzier states should be distributed in different files (usually with limited size) under the same folder. As is mentioned in the third issue above, the checkpoint should be agnostic to the condition of devices, so we shouldn't assign respective checkpointing folders for each device.

Since Gemini has implemented the sharding feature for model checkpointing, we can imitate its design:

When saving to checkpoint folder, the master rank collects optimizer states in the order of param_id, and load them to shard_buffer one after another. Every time the shard_buffer is full, dump it to disk and start to collect for the next round.
When loading checkpoint to an initialized optimizer, each process individually picks its needed optimizer state (since the mapping from param id to fake_param has been established during initialization, this process can be easily implemented). Then it loads the state segment into device with offset & size information of state shard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design of Optimizer Checkpointing for Gemini Plugin #4140

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Design of Optimizer Checkpointing for Gemini Plugin #4140

Fridge003 Jul 3, 2023

Background

Problems

Solutions

Management of param_groups

Method of Collecting Optimizer States

Sharding

Replies: 0 comments

Fridge003
Jul 3, 2023

Management of `param_groups`