Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533
Description
Feature request
Hello , thank you very much for creating the library transformers which provides a very convenient way to use the different models.
Recently, when I use Mistral Model by transformers (4.38.2) library to perform the model inference, the execution did not seem to be completely consistent with the official mistral papers and the code they provided.
Specifically, firstly the mistral have a Pre-fill and Chunking mechanism when prompt input encoding(As shown in the following picture). Is this method included in the transformers library's implementation of mistral's generate function?
Secondly, Mistral use a Rolling Buffer Cache mechanism to enable key values to be updated in the cache as the sliding window moves. (As shown in the following picture).
When I read this piece of the transformers source code, this mechanism is implemented in the transformer library using the DynamicCache update method, and this operation is actually to concatenate all the past keys and values with the current keys and values. https://github.com/huggingface/transformers/blob/b338a6c3b8eda29610d4d472cad8cd87cbfdaaed/src/transformers/cache_utils.py#L126C1-L132C105
Am I mistaken in the logic of this program or is the implementation not fully consistent with the mistral paper? In addition, I read the official source code of mistral and found that they do release two kinds of model inference code, one of which is simple version. Is this your current implementation?
Motivation
I want to know how to use transformers library on the mistral model to achieve the same way as in mistral paper.
Whether I read the code wrong or do I need to make some additional changes based on transformer?
Your contribution
I can try fix this piece code to achieve the same implementation as in the mistral paper based on transformers.