per_device_train_batch_sizeWhat it does:
Number of samples processed at once on each GPU/device.
Why it matters:
Larger batches use more GPU memory but train faster. Smaller batches save memory but may be less stable.
Example:
per_device_train_batch_size=2 → Each GPU processes 2 samples per forward/backward pass.
gradient_accumulation_stepsWhat it does:
How many batches to process before updating the model weights (sums gradients across steps)
Why it matters:
Mimics a larger batch size without crashing your GPU.
Example:
gradient_accumulation_steps=4 with per_device_train_batch_size=2 → Effective batch size = 2 × 4 = 8.
num_train_epochsWhat it does:
Total number of times the model sees the entire training dataset.
Why it matters:
Too few → model underfits. Too many → overfits.
Example:
num_train_epochs=3 → Train for 3 full passes through the data.
learning_rateWhat it does:
Step size for updating weights.
Why it matters:
Too high → unstable training. Too low → slow progress.
Example:
learning_rate=2e-5 → Common starting point for fine-tuning LLMs.
max_grad_normWhat it does:
Clips gradients to prevent them from exploding (sudden large updates).
Why it matters:
Stabilizes training, especially for deep networks.