DPO
IntermediateA preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
AdvertisementAd space — term-top
Definition
Full Definition
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.