ABSTRACT

Humans change their goals in several ways. We develop new preferences over time (e.g., learning to like a new food or hating a new philosophy); we change our understanding of the world (e.g., deciding that our concept of “people” really should apply to all intelligent and motivated beings); and we discover ways of “hacking” our motivation systems (e.g., drugs) and our representations of the world (e.g., simulation video games, Buddhism). Artificial agents will change their goals in similar ways, to the extent that they share the relevant computational properties of human minds. Here we explore four separable ways that goals can change. Motivation drift and representation drift are accidental changes, while motivation hacking and representation hacking (each of which has been termed “wireheading”) are changes made deliberately by an agent. We conclude that, while each of those four sources of goal change can be mitigated, some measures that mitigate against one type of goal change make other types more likely. In addition, the most obvious remedies for goal changes introduce substantial design challenges that may make them impractical to implement.