The last 3
decades of the 20th century witnessed a cascade of interest in the development
and use of standardized structured and semi-structured interviews for the diagnosis of mental disorders. This activity was the
culmination of several decades of growing dissatisfaction with the outcomes of traditional
unstructured interviews.
By the 1970s it was recognized that by using such
methods, clinicians commonly arrived at dissimilar diagnoses,
and rates of diagnostic agreement were no better than could be expected by chance (see
Beck, Ward, Mendelson, Mock, & Erbaugh, 1962; Spitzer & Fleiss, 1974). Clearly,
this state of affairs hampered advancement of knowledge about psychopathology.
Improving the reliability of psychiatric diagnoses became a research priority.
Structured and
semi-structured interviews are specifically designed to minimize the sources of variability that render diagnoses
unreliable. In traditional unstructured interviews, the clinician is entirely responsible for
determining what questions to ask and how the resulting information is to be used in
arriving at a diagnosis. Substantial inconsistency in outcomes is often the result, even when
explicit diagnostic criteria are available for reference.
Structured
interviews address such issues by standardizing the content, format, and order of questions to be
asked, and by providing algorithms to arrive at diagnostic conclusions from information
obtained that are in accordance with the diagnostic framework being employed.
The use of
structured and semi-structured interviews is now the standard in research settings. These strategies, administered in
various ways, are also the hallmark of empirically driven clinical practice.
For example, many empirically oriented clinicians administer select sections of
these interviews
to confirm suspected diagnoses or to rule out alternative diagnoses,
particularly if time is not available to administer the full instrument.
Across inpatient, outpatient, and research settings, standardized,
structured diagnostic interviews are rated positively by both respondents and
interviewers (Suppiger et al., 2009). here we discuss essential issues in their
evaluation and implementation.
Essential Issues
Criteria for Selecting an Interview
Several factors need to be considered when choosing a structured or a semi-structured interview. These factors are related not only to characteristics of the interview itself—such as its demonstrated psychometric qualities, degree of structure (i.e., highly structured vs. semi-structured, allowing for additional inquiry) and breadth of diagnostic coverage—but also to the context in which the interview is to be used.
Some of the
potential considerations, many of them consistently identified in reviews of this literature
(e.g., Blanchard & Brown, 1998), pertain to the content, format, and coverage
of the diagnostic interview; the level of expertise required for its
administration; and psychometric characteristics and the availability of support and
guidelines for its use.
No single
instrument best fits the requirements of all clinicians and researchers: When
selecting an interview, health care workers must consider their specific needs, priorities, and resources. For example, it
might be tempting to consider broad
diagnostic coverage,
excellent reliability, and validity to be essential criteria in all
instances; however, each
of these has the potential for drawbacks and can sometimes be mutually exclusive.
Broad diagnostic
coverage (i.e., number of disorders assessed for) often comes at the cost of in-depth information
about specific diagnoses—the classic “bandwidth versus fidelity” dilemma (Widiger
& Frances, 1987). Reliability, or the reproducibility of results, is enhanced by
increasing the degree of structure of the interview (i.e., minimizing the flexibility permitted
in inquiry and format of administration).
However, this
inflexibility has the potential to undermine the validity of the diagnosis. Customized
questions posed by an experienced clinician may clarify responses that would otherwise lead to
erroneous diagnostic conclusions. Such issues warrant consideration.
Understanding Psychometric Characteristics of Diagnostic Interviews Psychometric qualities are a foremost consideration in judging the worth of any measurement instrument and are equally important to consider when critically evaluating the diagnoses generated by structured and semi-structured interviews.
Reliability
The “reliability” of a diagnostic interview refers to its “replicability,” or the stability of its diagnostic outcomes. As already discussed, the historically poor reliability of psychiatric diagnoses was a principal basis for the development of structured interview techniques, and this issue continues to be of foremost importance.In light of this, researchers and clinicians should keep in mind that reliability is not an integral feature of a measurement instrument; it is a product of the context in which it was produced. Thus, reliability estimates are truly meaningful only to other applications of the interview that have comparable circumstances (e.g., administration format, training of interviewers, population). Each study should attempt to establish some form of reliability within its particular constraints. The same caveat applies to the issue of validity.
Inconsistency in
diagnoses can arise from multiple sources (see Ward, Beck, Mendelson, Mock,
& Erbaugh, 1962, for a seminal discussion), and two of these are
particularly worth noting. “Information variance” derives from different amounts and types of
information being used by different clinicians to arrive at the diagnosis. “Criterion
variance” arises from assembly of the same information being assembled in
different ways by different clinicians to arrive at a diagnosis, and from the
use of different standards for deciding when diagnostic criteria are met.
Another source of diagnostic inconsistency is “patient variance,” or variations
within the respondent that result in inconsistent reporting or clinical
presentation.
Two strategies are
principally used to test the reliability of diagnostic interviews. “Inter-rater”
(or “joint”) reliability
is the most common
reliability measure used in this area; here, two or more independent evaluators
usually rate identical interview material obtained through either direct
observation or videotape of a single assessment; in this case, there is only
one set of responses to be interpreted and rated.
In contrast,
“test–retest reliability” involves the administration of a diagnostic interview
on two independent occasions, usually separated by a maximum of 2 weeks and
often conducted by different evaluators. This less commonly used strategy of the
two is a more stringent test of reliability, because variability is potentially
introduced due to inconsistencies in styles of inquiry or in respondents’
self-reports. For example, whereas some respondents may attempt to be overly
self-consistent, others may be primed by the initial interview and report novel
information at retest. There is also a growing body of evidence that discrepant
reporting at retest is due to systematic “attenuation”—that is, respondents’
increased tendency to say “no” to items endorsed in the initial interview,
perhaps due to their learning more about the nature and purpose of the
interview as they gain experience (Lucas et al., 1999).
Interpretation of
reports of test–retest reliability is sometimes made difficult due to
variations in the methods employed. For example, if supplemental questions are permitted
in the follow-up interview to resolve diagnostic ambiguities (e.g., Helzer et
al., 1985), the question arises as to whether data should be considered
evidence of test–retest reliability rather than a form of validity, as
discussed later.
Interpretability of
results is also made challenging by a lack of consistency in the usage of the
terms. For example, reliability studies may be described as having a test–retest
design only if re-administration at retest is conducted by the same rater (see Segal
& Falk, 1998) or even when different raters are used (see Rogers, 2001).
Whether test–retest or
joint interview designs are employed, the statistic most commonly used to
report the degree of reliability observed is Cohen’s kappa. Different kappa
statistics can be used in different circumstances, such as when several diagnostic
categories are possible, and when multiple raters’ assessments are being compared.
The kappa index is superior to a measure such as percentage of agreement,
because it corrects for chance levels of agreement; this correction can lead to
highly variable kappa values due to differing base rates, however. Essentially,
the lower the base rate (or higher, if the base rate is greater than 50%), the
lower the kappa, posing a problem for researchers interested in phenomena where
the base rates are generally low, such as psychiatric diagnoses. For this
reason, another statistic, Yule’s Y, is sometimes used because of its
greater stability with low to medium base rates (Spitznagel & Helzer,
1985).
Intra-class correlation
coefficients (ICCs) are also sometimes
reported as an index of diagnostic reliability; these are calculated based on variance
in ratings accounted for by differences among clinicians and are best used with
large samples. Kappa coefficients range in value from –1.00 (perfect disagreement) to 1.00 (perfect agreement); a kappa of 0 indicates agreement
no better or worse than chance.
Conventional standards
for interpreting kappa values suggest that values greater than .75 indicate good reliability, those between .50 and .75 indicate fair reliability, and those below .50 denote poor reliability (Spitzer, Fleiss, & Endicott,
1978). However, there is some disagreement regarding these benchmarks. Landis
and Koch (1977) proposed that kappas within the range of .21 to .40 suggest
fair agreement. In summary, there are no definitive guidelines for the
interpretation of the kappa statistic; however, researchers usually consider
kappas of .40 to .50 as the lower limits of acceptability for structured
interviews.
The reliability of a
diagnostic interview is determined by many factors. These include
- · The clarity and nature of the questions asked and how well they are understood by the respondent,
- · The degree and consistency of training and experience of interviewers,
- · The conditions in which the interview is conducted,
- · The type of reliability assessed (e.g., test–retest, interrater),
- · The range and complexity of disorders under investigation, and
- · The base rate (or prevalence) of the diagnosis in the target population.
Validity
The validity of a diagnostic interview is closely bound to the validity of the diagnostic framework it operationalizes. If the way a disorder is conceptualized by, for example, the text revision of the fourth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV-TR; American Psychiatric Association, 2000) is problematic, a structured interview that loyally adheres to this framework will be invalid, no matter how psychometrically sound it is.
Thus, the matter of
“validity” encompasses much larger issues than simple psychometrics and
pertains to the very conventions adopted in framing and defining mental
disorders (see Widiger & Clark, 2000, for a discussion).
Much early work focused
on the validity of alternate diagnostic frameworks and criteria (e.g., Feighner
criteria, or Research Diagnostic Criteria [RDC] vs. DSM; see Feighner et al.,
1972; Spitzer, Endicott, & Robins, 1978), or how well they captured the
core characteristics of mental disorders. This research focus, though not its underlying
premises, has been rendered somewhat obsolete by the widespread adoption of DSM
as the predominant psychiatric nosology.
Most contemporary
research on the validity of structured interviews revolves around the issue of
how well they approximate the DSM standard. Even presupposing the validity of
the diagnostic framework used, determining the “validity” of a diagnostic
instrument, or how accurately it assesses the conditions it purports to assess,
poses a considerable challenge for researchers. Primarily, this is because
there is no infallible criterion index (i.e., “gold standard”) with which interview-generated
diagnoses can be compared.
Conventional strategy
for investigating the validity of a measurement instrument consists of
comparing its outcomes to those of another source, known to be a valid index of
the concept in question. In the case of diagnostic interviews, other sources of
information about diagnoses might include expert diagnosis and/or clinical
interview, chart review, or other diagnostic
interviews or indexes. Therein lies the problem.
interviews or indexes. Therein lies the problem.
Other diagnostic
instruments may themselves suffer from psychometric weaknesses, and reliance on
clinical diagnosis as an ultimate criterion seems misguided, begging the
question of why researchers began to use structured interviews in the first
place. Indeed, Robins, Helzer, Croughan, and Ratcliff (1981) referred to such
procedures as “bootstrapping,” or using one imprecise method to improve the
classificatory accuracy of another.
In light of these
issues, Spitzer (1983) proposed the LEAD standard—Longitudinal observation by
Experts using All available Data—as an optimal method to establish the
procedural validity of a diagnostic instrument. “Procedural validity” in this
case refers to the congruence between diagnoses generated by structured interview
versus expert clinicians.
The LEAD standard, also
known as a “best estimate” diagnosis, incorporates data collected
longitudinally from interviews, chart review, and other informants. Expert
clinicians then use all available data to come to a consensus diagnosis, which
serves as the criterion measure.
Unfortunately, this
rigorous method is time-consuming and expensive to apply, and has not been
widely adopted in validation research to date (see Booth, Kirchner, Hamilton,
Harrell, & Smith, 1998, for an exception).
There are three
principal categories of procedures for determining a test’s validity:
content-related, construct-related, and criterion-related. In contemporary
research on diagnostic interviews, the chief focus has been on the latter
category, with several forms of particular relevance. Although rarely seen
outside of the diagnostic assessment literature, the term “procedural validity”
is generally used to denote the degree of congruence between diagnoses
generated by structured interview versus expert clinicians. “Concurrent
validity” refers to the degree of correlation between scores on the interview
in question and scores on another established instrument administered
simultaneously. “Predictive validity” denotes the degree to which ratings on
the interview are associated with a specified criterion over a time interval
(e.g., diagnostic status of the individual or intervening course of the
disorder, at follow-up).
There is some
inconsistency in the use of this terminology, however. It is at times difficult
to
determine the comparability of validation results, because researchers have reported them using different terms. On a more basic level, it has been suggested that the very term “validity” is often erroneously used in this literature (Malgady, Rogler, & Tryon, 1992), in part because of reference to data better regarded as evidence of a diagnostic interview’s reliability.
determine the comparability of validation results, because researchers have reported them using different terms. On a more basic level, it has been suggested that the very term “validity” is often erroneously used in this literature (Malgady, Rogler, & Tryon, 1992), in part because of reference to data better regarded as evidence of a diagnostic interview’s reliability.
Statistics commonly
reported in the context of validity research include the following:
(1) “Specificity,” or
the percentage of non-cases of a disorder that has been identified correctly
(i.e., poor specificity results in over-detection);
(2) “Sensitivity,” or
the percentage of true cases of a disorder that has been identified correctly
(i.e., poor sensitivity results in under-detection); specificity and
sensitivity figures are proportional to the total number of non-cases and
cases, respectively, identified by the
instrument;
instrument;
(3) “Positive” and
“negative predictive values,” or the probability that individuals positive or negative for
a diagnosis, according to the instrument being validated, are similarly
identified according to the criterion; and
(4) “Hit rate,” or the number
of correct classifications relative to the total number of classifications
made. The kappa statistic is commonly reported as a general index of agreement.
In summary, an
understanding of the ways in which reliability and validity are defined and
evaluated in the literature on psychiatric diagnosis is essential when appraising
the relative merits of the many standardized interviews currently available.
References
Martin
m. Antony & David H. Barlow, 2010, Handbook of assessment and treatment
planning for Psychological disorders,
Second
edition, structured and semi-structured diagnostic interviews, The Guilford
Press
Read Also
Diagnostic Principles
Completing the Diagnostic Assessment
Evaluating the Psychometric Properties of Psychological measures (Introductory)
Brief measures for screening and measuring mental Health outcomes
Questions to consider when deciding which measurement instrument to use
Psychometrics and tool development considerations
Barriers to the implementation of Standardized Screening and outcomes measurement
Principles and Practice of assessment in primary care settings
Factors Influencing Assessment
Clinical Interviewing Dos and Don’ts
Diversity and the Interviewing Process
Making the Diagnostic Assessment
Role of Social Workers and Other Mental Health Professional
DSM-5—Long Awaited: Change and Controversy
Completing the Diagnostic Assessment
Evaluating the Psychometric Properties of Psychological measures (Introductory)
Brief measures for screening and measuring mental Health outcomes
Questions to consider when deciding which measurement instrument to use
Psychometrics and tool development considerations
Barriers to the implementation of Standardized Screening and outcomes measurement
Principles and Practice of assessment in primary care settings
Factors Influencing Assessment
Clinical Interviewing Dos and Don’ts
Diversity and the Interviewing Process
Making the Diagnostic Assessment
Role of Social Workers and Other Mental Health Professional
DSM-5—Long Awaited: Change and Controversy
No comments:
Post a Comment