Results for "per-parameter"
How many requests or tokens can be processed per unit time; affects scalability and cost.
Popular optimizer combining momentum and per-parameter adaptive step sizes via first/second moment estimates.
Using same parameters across different parts of a model.
Methods like Adam adjusting learning rates dynamically.
Tradeoffs between many layers vs many neurons per layer.
Cost to run models in production.
Assigning labels per pixel (semantic) or per instance (instance segmentation) to map object boundaries.
Maximum system processing rate.
Measures how much information an observable random variable carries about unknown parameters.
Bayesian parameter estimation using the mode of the posterior distribution.
Probability of data given parameters.
Belief before observing data.
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
Techniques that fine-tune small additional components rather than all weights to reduce compute and storage.
The shape of the loss function over parameter space.
Estimating parameters by maximizing likelihood of observed data.
A narrow minimum often associated with poorer generalization.
A wide basin often correlated with better generalization.
Updated belief after observing data.
Visualization of optimization landscape.
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
Low-latency prediction per request.
Increasing model capacity via compute.
Halting training when validation performance stops improving to reduce overfitting.
A gradient method using random minibatches for efficient training on large datasets.
PEFT method injecting trainable low-rank matrices into layers, enabling efficient fine-tuning.
Systematic error introduced by simplifying assumptions in a learning algorithm.
A point where gradient is zero but is neither a max nor min; common in deep nets.