ok i tried bringing back original init again and this time it makes a…

… ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.
isbee · Jan 27, 2023 · f29a9ff · f29a9ff
1 parent 23a0bfa
commit f29a9ff
Showing 1 changed file with 0 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -157,7 +157,6 @@ Features / APIs
 
 Suspiciousness
 
-- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
 - I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
 - I keep seeing different values cited for weight decay and AdamW betas, look into
 - I can't exactly reproduce Chinchilla paper results, see [scaling_laws.ipynb](scaling_laws.ipynb) notebook