Aram Ebtekar returns to give his second talk at the regular UAI research meetings this Monday (3 pm EDT on April 6th at the regular zoom link), this time on an exciting new AI safety protocol analyzed rigorously in the AIXI setting.
Title:
Golden Handcuffs make safer AI agents
Abstract:
Reinforcement learners often find novel and undesirable ways to attain high reward. We
study a Bayesian mitigation for general environments: we expand the agent’s subjective
reward range to include a large negative value −L, while the true environment’s rewards lie
in [0, 1]. After observing consistently high rewards, the Bayesian policy becomes risk-averse
to novel schemes that plausibly lead to −L. We design a simple override mechanism that
yields control to a trusted mentor when the predicted value drops below a fixed threshold.
We prove the resulting agent’s properties: (i) Capability: using mentor-guided exploration
with vanishing frequency, the agent attains sublinear regret relative to every mentor. (ii)
Safety: if it starts with a universal mixture prior, the agent never triggers any given
decidable low-complexity predicate before a mentor does.
Leave a comment