1. Core Training Parameters

`per_device_train_batch_size`

What it does:

Number of samples processed at once on each GPU/device.
Why it matters:

Larger batches use more GPU memory but train faster. Smaller batches save memory but may be less stable.
Example:

per_device_train_batch_size=2 → Each GPU processes 2 samples per forward/backward pass.

`gradient_accumulation_steps`

What it does:

How many batches to process before updating the model weights (sums gradients across steps)
Why it matters:

Mimics a larger batch size without crashing your GPU.
Example:

gradient_accumulation_steps=4 with per_device_train_batch_size=2 → Effective batch size = 2 × 4 = 8.

`num_train_epochs`

What it does:

Total number of times the model sees the entire training dataset.
Why it matters:

Too few → model underfits. Too many → overfits.
Example:

num_train_epochs=3 → Train for 3 full passes through the data.

`learning_rate`

What it does:

Step size for updating weights.
Why it matters:

Too high → unstable training. Too low → slow progress.
Example:

learning_rate=2e-5 → Common starting point for fine-tuning LLMs.

`max_grad_norm`

What it does:

Clips gradients to prevent them from exploding (sudden large updates).
Why it matters:

Stabilizes training, especially for deep networks.