Add auto variable sharding for all backbones/tasks

We want model parallelism to be easy to use across the library. At a high level, a user should express their hardware, and (possibly) desired model parallel vs data parallel split for the device grid.

Currently, we have a auto layer helper for Gemma [here](https://github.com/keras-team/keras-hub/blob/adf6dc7ba770de03d54f257c7f5574437e2f8978/keras_hub/src/models/gemma/gemma_backbone.py#L228), but it is not a salable design. The correct layout map will depend on the config of the model. E.g. you need to shard a Gemma model with multi-head-attention differently then multi-query-attention.

I think there's two main direction we can go with the API.
1) Write our own manual sharing for a model given the config for a model. Do this for all models (most will have the same recipe, especially for our transformer models).
2) Use some form of autosharding functionality in Jax, or add a autosharding API to Keras. In this case, we will not need to write the sharding recipes ourselves per model.

One potential high-level API would be to directly take in a device mesh when constructing the model. For both 1) and 2), we could support an API something like this...

```
device_mesh = DeviceMesh(shape=(2, 4), axis_names=('batch', 'model'), devices=devices)
gemma_model = keras_nlp.models.GemmaCausalLM.from_preset(
    "gemma_2b_en",
    device_mesh=device_mesh,
)
```

For 1) we would need to enter into a LayoutMap scope after loading the config for a model but before loading the weights. For 2) it would depend on the details of the autosharding API we use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto variable sharding for all backbones/tasks #1689

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development