ABSTRACT

Critical Thinking (CT) is considered a desirable learning outcome at all levels of education, included Higher Education (HE). Experts agree with the idea that CT assessment should include both Multiple Choice and Constructed Response Task (CRT) kinds of items (Hyytinen et al. 2015; Ku 2009; Liu et al. 2014) in order to retrieve the complexity of the CT construct. Nevertheless, CRT tasks are poorly used because of the costs of scoring and reliability issues. According to Liu, Frankel and Roohr (2014) automatic assessment of open-ended answers could be a viable solution to these concerns. In recent years, many attempts have been carried out in order to develop and validate tools for the automatic assessment of CT related-skills. Most of these studies applied Natural Language Processing techniques (NLP) to English written texts and there are a few attempts to generalize these techniques to other languages.

Therefore, this research was aimed at understanding which NLP features are more associated with six CT sub-dimensions (Poce 2017) as assessed by human raters in essays written in Italian language. The study used a corpus of pre-post essays written in Italian language by 200 students who attended a Master Degree University course in “Experimental Education and School Assessment”. The essays were assessed both by human raters and by a computerized tool which automatically calculates different kinds of NLP features. Pearson correlations were calculated to explore the strength of the association between NLP features and CT sub-skills assed by human raters. Non-parametric Test U Mann Whitney was used to explore average difference scores between pre-post test scores in CT sub-skills (as assessed by human raters) and NLP features. We found out a strong correlation between the NLP feature “elaboration” and the CT sub-skills relevance (r=0,89) and importance (r=0,75). Elaboration was calculated as an inverted measure of verbatim copying. As expected, syntax complexity positively relates with Language skills as assessed by human rater whilst repetition negatively correlates with the same skill. It was also possible to retrieve similar trends in pre-posttest between scores provided by human raters and NLP features. Limitation and future developments will be discussed.