Results for "tokens set"
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Samples from the smallest set of tokens whose probabilities sum to p, adapting set size by context.
Detecting unauthorized model outputs or data leaks.
Samples from the k highest-probability tokens to limit unlikely outputs.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Prevents attention to future tokens during training/inference.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Generates sequences one token at a time, conditioning on past tokens.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
How many requests or tokens can be processed per unit time; affects scalability and cost.
Encodes positional information via rotation in embedding space.
Encodes token position explicitly, often via sinusoids.
Attention mechanisms that reduce quadratic complexity.
The set of tokens a model can represent; impacts efficiency, multilinguality, and handling of rare strings.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Techniques to handle longer documents without quadratic cost.
Limiting inference usage.
Set of vectors closed under addition and scalar multiplication.
Search algorithm for generation that keeps top-k partial sequences; can improve likelihood but reduce diversity.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
Probabilistic graphical model for structured prediction.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Stores past attention states to speed up autoregressive decoding.
Transformer applied to image patches.
Cost to run models in production.
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
A measure of a model class’s expressive capacity based on its ability to shatter datasets.