Results for "long context"
A narrow minimum often associated with poorer generalization.
A wide basin often correlated with better generalization.
Matrix of second derivatives describing local curvature of loss.
Neural networks can approximate any continuous function under certain conditions.
Early architecture using learned gates for skip connections.
Encodes positional information via rotation in embedding space.
Chooses which experts process each token.
Empirical laws linking model size, data, compute to performance.
All possible configurations an agent may encounter.
Strategy mapping states to actions.
Models trained to decide when to call tools.
Embedding signals to prove model ownership.
Compromising AI systems via libraries, models, or datasets.
Neural networks that operate on graph-structured data by propagating information along edges.
Graphical model expressing factorization of a probability distribution.
Pixel-wise classification of image regions.
End-to-end process for model training.
Interleaving reasoning and tool use.
Scaling law optimizing compute vs data.
Cost to run models in production.
Declining differentiation among models.
Vector whose direction remains unchanged under linear transformation.
Number of linearly independent rows or columns.
Sensitivity of a function to input perturbations.
Minimum relative to nearby points.
Probability of data given parameters.
Correctly specifying goals.
Lowest possible loss.
Ensuring learned behavior matches intended objective.
Methods like Adam adjusting learning rates dynamically.