ABSTRACT

Reinforcement learning (RL) emphasizes learning through interaction with (real or virtual) environments. Feedback from one’s environment is essential for learning. RL can be used when the correct answer is difficult to define or there are too many steps for the agent to take to complete the task. Reinforcement learning can embody a model-based approach, such as a Markov decision process. The methods used to solve the optimization problem include dynamic programming with either policy-based or value-based algorithms. Model-free RL approaches include Bayesian Q-learning. A Markov decision process (MDP) is similar to a Markov chain, but there are also actions and utilities involved. RL with Monte Carlo simulations can provide a rational basis for decisionmaking and help in optimizing a compound’s regulatory strategy and determining its commercial position and value. One of the challenges in application of a MDP in drug development is that the transition probability is dependent on the model parameters (e.g., treatment effect) that are unknown.