Results for "text+image+audio"
Generating speech audio from text, with control over prosody, speaker identity, and style.
Detects trigger phrases in audio streams.
Generates audio waveforms from spectrograms.
Converting audio speech into text, often using encoder-decoder or transducer architectures.
Identifying speakers in audio.
Joint vision-language model aligning images and text.
Maps audio signals to linguistic units.
Aligns transcripts with audio timestamps.
Generating human-like speech from text.
Transformer applied to image patches.
Models that process or generate multiple modalities, enabling vision-language tasks, speech, video understanding, etc.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Assigning labels per pixel (semantic) or per instance (instance segmentation) to map object boundaries.
Assigning category labels to images.
External sensing of surroundings (vision, audio, lidar).
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Pixel-wise classification of image regions.
Combining signals from multiple modalities.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Generates sequences one token at a time, conditioning on past tokens.
The text (and possibly other modalities) given to an LLM to condition its output behavior.
Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.
Models that learn to generate samples resembling training data.
Attention between different modalities.
Recovering 3D structure from images.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.