Domain: Evaluation & Benchmarking
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
A proper scoring rule measuring squared error of predicted probabilities for binary outcomes.
Logging hyperparameters, code versions, data snapshots, and results to reproduce and compare experiments.
A broader capability to infer internal system state from telemetry, crucial for AI services and agents.
Exponential of average negative log-likelihood; lower means better predictive fit, not necessarily better utility.
Often more informative than ROC on imbalanced datasets; focuses on positive class performance.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.