Dataset: fineweb_sample_10B_np_bin
Expt. 1: Baseline
The goal of this experiment was to establish a baseline to compare performance of subsequent experiments. This experiment was run on a L4 GPU on lightning.ai.
Highlights
- ~35.6M parameter model.
- trained on ~1B tokens from the fineweb dataset.
- nothing fancy. Made little to no modifications to the original nanoGPT code.
NB 1: Code repo
NB 2: Wandb run
Config
- Model Architecture:
n_embd
: 512n_head
: 8n_layer
: 6block_size
: 1024bias
: falsedropout
: 0 (no dropout)
- Optimizer:
learning_rate
: 0.0006min_lr
: 0.00006beta1
: 0.9beta2
: 0.95weight_decay
: 0.1grad_clip
: 1gradient_accumulation_steps
: 16
- Learning Rate Schedule:
decay_lr
: truewarmup_iters
: 100lr_decay_iters
: 2000
- Training Duration:
max_iters
: 2000batch_size
: 32eval_interval
: 40eval_iters
: 50
- Misc:
init_from
: “scratch”always_save_checkpoint
: trueout_dir
: “out”device
: “cuda”dtype
: “bfloat16”backend
: “nccl”compile
: true
- Logging:
wandb_log
: truewandb_project
: “gippity”wandb_run_name
: “gippity-chinchilla-baseline”