Results for "hidden objectives"
Model exploits poorly specified objectives.
Model optimizes objectives misaligned with human values.
Agents have opposing objectives.
A hidden variable influences both cause and effect, biasing naive estimates of causal impact.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
A narrow hidden layer forcing compact representations.
Extracting system prompts or hidden instructions.
Probabilistic energy-based neural network with hidden variables.
Models time evolution via hidden states.
Temporary reasoning space (often hidden).
Probabilistic model for sequential data with latent states.