TY - JOUR
T1 - Validating automated sentiment analysis of online cognitive behavioral therapy patient texts
T2 - An exploratory study
AU - Provoost, Simon
AU - Ruwaard, Jeroen
AU - van Breda, Ward
AU - Riper, Heleen
AU - Bosse, Tibor
PY - 2019/5/14
Y1 - 2019/5/14
N2 - Introduction: Sentiment analysis may be a useful technique to derive a user's emotional state from free text input, allowing for more empathic automated feedback in online cognitive behavioral therapy (iCBT) interventions for psychological disorders such as depression. As guided iCBT is considered more effective than unguided iCBT, such automated feedback may help close the gap between the two. The accuracy of automated sentiment analysis is domain dependent, and it is unclear how well the technology is applicable to iCBT. This paper presents an empirical study in which automated sentiment analysis by an algorithm for the Dutch language is validated against human judgment. Methods: A total of 493 iCBT user texts were evaluated on overall sentiment and the presence of five specific emotions by an algorithm, and by 52 psychology students who evaluated 75 randomly selected texts each, providing about eight human evaluations per text. Inter-rater agreement (IRR) between algorithm and humans, and humans among each other, was analyzed by calculating the intra-class correlation under a numerical interpretation of the data, and Cohen's kappa, and Krippendorff's alpha under a categorical interpretation. Results: All analyses indicated moderate agreement between the algorithm and average human judgment with respect to evaluating overall sentiment, and low agreement for the specific emotions. Somewhat surprisingly, the same was the case for the IRR among human judges, which means that the algorithm performed about as well as a randomly selected human judge. Thus, considering average human judgment as a benchmark for the applicability of automated sentiment analysis, the technique can be considered for practical application. Discussion/Conclusion: The low human-human agreement on the presence of emotions may be due to the nature of the texts, it may simply be difficult for humans to agree on the presence of the selected emotions, or perhaps trained therapists would have reached more consensus. Future research may focus on validating the algorithm against a more solid benchmark, on applying the algorithm in an application in which empathic feedback is provided, for example, by an embodied conversational agent, or on improving the algorithm for the iCBT domain with a bottom-up machine learning approach.
AB - Introduction: Sentiment analysis may be a useful technique to derive a user's emotional state from free text input, allowing for more empathic automated feedback in online cognitive behavioral therapy (iCBT) interventions for psychological disorders such as depression. As guided iCBT is considered more effective than unguided iCBT, such automated feedback may help close the gap between the two. The accuracy of automated sentiment analysis is domain dependent, and it is unclear how well the technology is applicable to iCBT. This paper presents an empirical study in which automated sentiment analysis by an algorithm for the Dutch language is validated against human judgment. Methods: A total of 493 iCBT user texts were evaluated on overall sentiment and the presence of five specific emotions by an algorithm, and by 52 psychology students who evaluated 75 randomly selected texts each, providing about eight human evaluations per text. Inter-rater agreement (IRR) between algorithm and humans, and humans among each other, was analyzed by calculating the intra-class correlation under a numerical interpretation of the data, and Cohen's kappa, and Krippendorff's alpha under a categorical interpretation. Results: All analyses indicated moderate agreement between the algorithm and average human judgment with respect to evaluating overall sentiment, and low agreement for the specific emotions. Somewhat surprisingly, the same was the case for the IRR among human judges, which means that the algorithm performed about as well as a randomly selected human judge. Thus, considering average human judgment as a benchmark for the applicability of automated sentiment analysis, the technique can be considered for practical application. Discussion/Conclusion: The low human-human agreement on the presence of emotions may be due to the nature of the texts, it may simply be difficult for humans to agree on the presence of the selected emotions, or perhaps trained therapists would have reached more consensus. Future research may focus on validating the algorithm against a more solid benchmark, on applying the algorithm in an application in which empathic feedback is provided, for example, by an embodied conversational agent, or on improving the algorithm for the iCBT domain with a bottom-up machine learning approach.
KW - Automated support
KW - Benchmarking and validation
KW - Cognitive behavioral therapy (CBT)
KW - Depression
KW - E-mental health
KW - Embodied conversational agent (ECA)
KW - Internet interventions
KW - Sentiment analysis and opinion mining
UR - http://www.scopus.com/inward/record.url?scp=85068465654&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068465654&partnerID=8YFLogxK
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85068465654&origin=inward
UR - https://www.ncbi.nlm.nih.gov/pubmed/31156504
U2 - https://doi.org/10.3389/fpsyg.2019.01065
DO - https://doi.org/10.3389/fpsyg.2019.01065
M3 - Article
C2 - 31156504
SN - 1664-1078
VL - 10
SP - 1
EP - 12
JO - Frontiers in Psychology
JF - Frontiers in Psychology
IS - MAY
M1 - 1065
ER -