ABSTRACT

ABSTRACT The notion of “wireheading,” or direct reward center stimulation of the brain, is a well-known concept in neuroscience. In this chapter, we examine the corresponding issue of reward (utility) function integrity in artificially intelligent machines. We survey the relevant literature and propose a number of potential solutions to ensure the integrity of our artificial assistants. Overall, we conclude that wireheading in rational self-improving optimizers above a certain capacity remains an unsolved problem despite opinion of many that such machines will choose not to wirehead. A relevant issue of literalness in goal setting also remains largely unsolved, and we suggest that development of a nonambiguous knowledge transfer language might be a step in the right direction.