Introduction.We wanted to measure adherence to the guideline for depression in disability assessments. The research questions we addressed were: How can we develop performance indicators (PIs) for adherence to the Dutch guideline for disability assessment of patients with depression and how can we measure the quality of the scores? What is the inter-rater reliability of these PIs? What is the quality of the PI scores? Methods.PIs, developed by the researchers, were reviewed on various aspects, by a panel of seven experts in several consulting rounds. After adjustments, senior insurance physicians (IPs) attended two training sessions and scored the PIs on 10 different simulated case reports. Two researchers developed proxy 'gold standard' scores for these 10 case reports. To assess the inter-rater reliability and the quality of the scores, we calculated the intra-class correlations (ICC) and 95% confidence intervals (CI) of the PI scores and of the PI scores compared to the proxy 'gold standard', respectively. Results.Six specific and relevant PIs resulted from the consultation of the panel of experts. The PI scores for the 10 case reports, rated by seven (of the eight) senior IPs who completed both training sessions, showed that the PIs were not reliable at individual level (ICC=0.543; 95% CI 0.4260.642). However, the ICC became more reliable as an average of two raters was calculated (ICC=0.704). The ICC of the PI scores with the proxy 'gold standard' was 0.538 (95% CI 0.4190.640), but the quality was higher when calculated as an average of two raters (ICC=0.700). Conclusion.The PIs for adherence to the guideline were sufficiently reliable, and the quality of their scores was adequate if at least two well-trained raters were involved. The senior IPs evaluated the feasibility of the PIs as good, with a prerequisite of sufficient training. This method may be interesting for measuring guideline adherence and quality of disability assessments in general. © 2011 Informa UK, Ltd.