Results for "training loss"
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
Activation max(0, x); improves gradient flow and training speed in deep nets.
Gradients grow too large, causing divergence; mitigated by clipping, normalization, careful init.
Methods to set starting weights to preserve signal/gradient scales across layers.
Randomly zeroing activations during training to reduce co-adaptation and overfitting.
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
Maliciously inserting or altering training data to implant backdoors or degrade performance.
Attacks that infer whether specific records were in training data, or reconstruct sensitive training examples.
Error due to sensitivity to fluctuations in the training dataset.
Adjusting learning rate over training to improve convergence.
Models that learn to generate samples resembling training data.
Increasing performance via more data.
Scaling law optimizing compute vs data.
Methods like Adam adjusting learning rates dynamically.
Train/test environment mismatch.
Randomizing simulation parameters to improve real-world transfer.
Learning a function from input-output pairs (labeled data), optimizing performance on predicting outputs for unseen inputs.
Learning where data arrives sequentially and the model updates continuously, often under changing distributions.
Reusing knowledge from a source task/domain to improve learning on a target task/domain, typically via pretrained models.
Automatically learning useful internal features (latent variables) that capture salient structure for downstream tasks.
The degree to which predicted probabilities match true frequencies (e.g., 0.8 means ~80% correct).
A parameterized function composed of interconnected units organized in layers with nonlinear activations.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
Updating a pretrained model’s weights on task-specific data to improve performance or adapt style/behavior.
Local surrogate explanation method approximating model behavior near a specific input.
A formal privacy framework ensuring outputs do not reveal much about any single individual’s data contribution.
Optimization problems where any local minimum is global.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Optimization with multiple local minima/saddle points; typical in neural networks.