Results for "human values"
System-level design for general intelligence.
Ensuring AI allows shutdown.
Tendency for agents to pursue resources regardless of final goal.
Ensuring learned behavior matches intended objective.
Learned subsystem that optimizes its own objective.
Maintaining alignment under new conditions.
Tradeoff between safety and performance.
Signals indicating dangerous behavior.
Isolating AI systems.
Tendency to gain control/resources.
Designing AI to cooperate with humans and each other.
The learned numeric values of a model adjusted during training to minimize a loss function.
A function measuring prediction error (and sometimes calibration), guiding gradient-based optimization.
Scalar summary of ROC; measures ranking ability, not calibration.
Average of squared residuals; common regression objective.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
Feature attribution method grounded in cooperative game theory for explaining predictions in tabular settings.
Predicting future values from past observations.
Expected return of taking action in a state.
Probability of data given parameters.
Studying internal mechanisms or input influence on outputs (e.g., saliency maps, SHAP, attention analysis).
Ordering training samples from easier to harder to improve convergence or generalization.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Tendency to trust automated suggestions even when incorrect; mitigated by UI design, training, and checks.
Converting audio speech into text, often using encoder-decoder or transducer architectures.
Extending agents with long-term memory stores.
Generating speech audio from text, with control over prosody, speaker identity, and style.
Legal or policy requirement to explain AI decisions.
Combining signals from multiple modalities.
Generates audio waveforms from spectrograms.