Reinforcement learning for intensive care medicine: actionable clinical insights from novel approaches to reward shaping and off-policy model evaluation

Luca F. Roggeveen, Ali el Hassouni, Harm Jan de Grooth, Armand R.J. Girbes, Mark Hoogendoorn, Paul W.G. Elbers

Research output: Contribution to journalArticleAcademicpeer-review


Background: Reinforcement learning (RL) holds great promise for intensive care medicine given the abundant availability of data and frequent sequential decision-making. But despite the emergence of promising algorithms, RL driven bedside clinical decision support is still far from reality. Major challenges include trust and safety. To help address these issues, we introduce cross off-policy evaluation and policy restriction and show how detailed policy analysis may increase clinical interpretability. As an example, we apply these in the setting of RL to optimise ventilator settings in intubated covid-19 patients. Methods: With data from the Dutch ICU Data Warehouse and using an exhaustive hyperparameter grid search, we identified an optimal set of Dueling Double-Deep Q Network RL models. The state space comprised ventilator, medication, and clinical data. The action space focused on positive end-expiratory pressure (peep) and fraction of inspired oxygen (FiO2) concentration. We used gas exchange indices as interim rewards, and mortality and state duration as final rewards. We designed a novel evaluation method called cross off-policy evaluation (OPE) to assess the efficacy of models under varying weightings between the interim and terminal reward components. In addition, we implemented policy restriction to prevent potentially hazardous model actions. We introduce delta-Q to compare physician versus policy action quality and in-depth policy inspection using visualisations. Results: We created trajectories for 1118 intensive care unit (ICU) admissions and trained 69,120 models using 8 model architectures with 128 hyperparameter combinations. For each model, policy restrictions were applied. In the first evaluation step, 17,182/138,240 policies had good performance, but cross-OPE revealed suboptimal performance for 44% of those by varying the reward function used for evaluation. Clinical policy inspection facilitated assessment of action decisions for individual patients, including identification of action space regions that may benefit most from optimisation. Conclusion: Cross-OPE can serve as a robust evaluation framework for safe RL model implementation by identifying policies with good generalisability. Policy restriction helps prevent potentially unsafe model recommendations. Finally, the novel delta-Q metric can be used to operationalise RL models in clinical practice. Our findings offer a promising pathway towards application of RL in intensive care medicine and beyond.

Original languageEnglish
Article number32
JournalIntensive Care Medicine Experimental
Issue number1
Publication statusPublished - Dec 2024


  • Clinical policy inspection
  • Covid-19
  • Cross off-policy evaluation
  • Explainability
  • Mechanical ventilation
  • Peep
  • Policy filtering
  • Reinforcement learning
  • Safety
  • fo2

Cite this