This chapter develops an argument for acknowledging the ways in which computer vision sees the world when introducing machine learning techniques into the study of images of conflict. Reflecting on a process of training a computer vision system to see violence in Twitter images from protests, we suggest a conceptual framework for tracing how computer vision learns to see. Our conceptual understanding operates within three dimensions: (a) technology and problems of current computer vision solutions when working with a socio-culturally complex phenomenon such as the visual representation of conflict; (b) epistemology at intersection of humanistic and sociological inquiry, and computer vision; and (c) humans both as subject of the computer vision as well as often invisible actors involved in the training of a computer vision system. Tracing the process of how computer vision learns to see within the dimensions of technology, epistemology and humans reminds us to avoid oversimplification and classifications in favour of quantification. Instead, this conceptual framework encourages us to produce meaning by unfolding the complexity of computer vision’s ways of seeing images from conflict.