ABSTRACT

This paper discusses the problem of the reinforcement driven learning of a response to a time varying sequence. The problem has three parts: the adaptation of internal parameters to model complex mappings; the ability of the architecture to represent time varying input; and the problem of credit assignment with unknown delays between the input, output and reinforcement signals. The method developed in this paper is based on a connectionist network trained using the error propagation algorithm with internal feedback. The network is viewed both as a context dependent predictor of the reinforcement signal and as a means of temporal credit assignment. Several architectures for these networks are discussed and insight into the implementation problems is gained by an application to the game of noughts and crosses.