Select Page

Table 17.2-Evaluation Criteria and Standards

 
Criterion Definition Standard
  1. Appropriateness
The match of the instrument to the purpose/question under study. One must determine what information is required and what use will be made of the information gathered (Wade 1992) Depends upon the specific purpose for which the measurement is intended.

2. Reliability

Refers to the reproducibility and internal consistency of the instrument.
  • Reproducibility addresses the degree to which the score is free from random error. Test re-test & inter-observer reliability both focus on this aspect of reliability and are commonly evaluated using correlation statistics including Intra-Class Correlation Coefficient (ICC), Pearson’s or Spearman’s coefficients and kappa coefficients (weighted or unweighted).
  • Internal consistency assesses the homogeneity of the scale items. It is generally examined using split-half reliability or Cronbach’s alpha statistics. Item-to-item and item-to scale correlations are also accepted methods.
Test-retest or interobserver reliability (ICC; kappa statistics; Andresen 2000; Hseuh et al. 2001; Wolfe et al.1991).
  • Excellent: ≥ 0.75;
  • Adequate: 0.4-0.74;
  • Poor: ≤ 0.40
Note: Fitzpatrick et al. (1998) recommend a minimum test-retest reliability of 0.90 if the measure is to be used to evaluate the ongoing progress of an individual in a treatment situation. Internal consistency (split-half or Cronbach’s α statistics; Andresen 2000):
  • Excellent: ≥ 0.80;
  • Adequate: 0.70-0.79;
  • Poor<0.70
Note: Fitzpatrick et al. (1998) caution that α values in excess of 0.90 may indicate redundancy. Adequate levels of Inter-item & item-to-scale correlation coefficients (Hobart et al. 2001; Fitzpatrick et al. 1998):
  • inter-item: between 0.3 and 0.9;
  • item-to-scale: between 0.2 and 0.9

3. Validity

Does the instrument measure what it purports to measure? Forms of validity include face, content, construct, and criterion. Concurrent, convergent or discriminative, and predictive validity are all considered to be forms of criterion validity. However, concurrent, convergent and discriminative validity all depend on the existence of a “gold standard” to provide a basis for comparison. If no gold standard exists, they represent a form of construct validity in which the relationship to another measure is hypothesized (Finch et al., 2002). Construct/convergent and concurrent correlations (Andresen 2000; McDowell & Newell; Fitzpatrick et al. 1998; Cohen et al. 2000):
  • Excellent: ≥ 0.60, Adequate: 0.31-0.59, Poor: ≤ 0.30
ROC analysis-AUC (McDowell & Newell 1996):
  • Excellent: ≥0.90, Adequate: 0.70-0.89, Poor: <0.70
There are no agreed on standards by which to judge sensitivity and specificity as a validity index (Riddle & Stratford, 1999). Predictive Validity: According to Shukla et al. (2011), when using many of these instruments, there is no “defined threshold score beyond which an accurate prediction can be made”.

4. Responsiveness

Sensitivity changes within patients over time, which may be indicative of therapeutic effects. Responsiveness is most commonly evaluated through correlation with other change scores, effect sizes, standardized response means, relative efficiency, sensitivity and specificity of change scores and ROC analysis. Assessment of possible floor and ceiling effects are included as they indicate limits to the range of detectable change beyond which no further improvement or deterioration can be noted. Sensitivity to change: Excellent: Evidence of change in expected direction using methods such as standardized effect sizes:
  • <0.5=small; 0.5-0.8=moderate;
≥0.8=large By way of standardized response means: ROC analysis of change scores (area under the curve-see above) or relative efficiency. Adequate: Evidence of moderate/less change than expected; conflicting evidence. Poor: Weak evidence based solely on p-values (statistical significance; Andresen 2000; McDowell & Newell; Fitzpatrick et al. 1998; Cohen et al. 2000) Floor/Ceiling Effects: Excellent: No floor or ceiling effects Adequate: Floor and ceiling effects ≤20% of patients who attain either the minimum (floor) or maximum (ceiling) score. Poor: >20% (Hobart, 2001 #14).

5. Precision

Number of gradations or distinctions within the measurement. For example, a yes/no response versus a 7-point Likert response set. Depends on the precision required for the purpose of the measurement (e.g. classification, evaluation, prediction).

6. Interpretability

How meaningful are the scores? Are there consistent definitions and classifications for results? Are there norms available for comparison? Jutai and Teasell (2003) point out these practical issues should not be separated from the consideration of the values that underscore the selection of outcome measures. A brief assessment of practicality will accompany each summary evaluation.

7. Acceptability

How acceptable the scale is in terms of completion by the patient; does it represent a burden? Can the assessment be completed by proxy if necessary?

8. Feasibility

Extent of effort, burden, expense and disruption to staff/clinical care arising from the administration of the instrument.