Visual Language Model Training Guide
- Status: Closed
- Pris: $10
- Bidrag mottagna: 1
- Vinnare: dvprasannavp
Tävlingssammandrag
Layout Overview
The image consists of three main steps for training a model, each placed side-by-side, from left to right. Each step explains a distinct part of the process to train a language model. The steps are visually segmented and labeled, containing text blocks, arrows, and icons.
Step 1: Collect Demonstration Data and Train a Supervised Policy
Header:
"Step 1" is written at the top.
The description is: "Collect demonstration data and train a supervised policy."
Content Block:
A prompt is sampled from the prompt dataset.
Icon Block: There is a green rectangular box with the text "Explain reinforcement learning to a 6-year-old." This represents a sample prompt.
The prompt leads down to another block with the following text: "A labeler demonstrates the desired output behavior."
Arrow and Explanation:
There is an arrow pointing down from the labeler box.
Final Output: The text states: "This data is used to fine-tune GPT-3.5 with supervised learning."
Step 2: Collect Comparison Data and Train a Reward Model
Header:
"Step 2" is written at the top.
The description is: "Collect comparison data and train a reward model."
Content Block:
A prompt is sampled, along with several model outputs.
Icon Block: There is a green box, again containing the sample prompt "Explain reinforcement learning to a 6-year-old."
Below this, several output samples are illustrated visually in a block.
The labeler then ranks these outputs from best to worst.
Arrow and Explanation:
An arrow points downward from the rank block.
Final Output: The text states: "This data is used to train our reward model."
Step 3: Optimize a Policy Against the Reward Model Using the PPO Reinforcement Learning Algorithm
Header:
"Step 3" is written at the top.
The description is: "Optimize a policy against the reward model using the PPO reinforcement learning algorithm."
Content Block:
A new prompt is sampled from the dataset.
Icon Block: A green rectangular box has the prompt: "Write a story about otters."
Below, there is an arrow pointing to a series of steps:
The PPO model is initialized from the supervised policy.
Policy generates an output.
Reward Model calculates a reward for the output.
Loop Structure:
An arrow loop visually indicates an iterative update process:
"The reward is used to update the policy using PPO."
Summary
Each step (1, 2, and 3) explains the sequential process of training the language model. Step 1 focuses on supervised learning using demonstration data, Step 2 on training a reward model via ranking, and Step 3 on optimizing the policy using reinforcement learning.
The steps are divided visually into three vertical segments with arrows guiding the sequence of actions. The green boxes provide specific prompts to illustrate examples at different phases of training.
Rekommenderade kompetenser
Topp bidrag från den här tävlingen
-
dvprasannavp India
Klargörandetavla
Hur du kommer igång med tävlingar
-
Lägg upp din tävling Snabbt och enkelt
-
Få massvis med bidrag Från världens alla hörn
-
Utse det bästa bidraget Ladda ner filerna - enkelt!