Search

Saturday, April 13, 2019

Structured and Semi-Structured Diagnostic Interviews

By: Laura J. Summerfield, Patricia H. kloosterman and Martin M. Antony


The last 3 decades of the 20th century witnessed a cascade of interest in the development and use of standardized structured and semi-structured interviews for the diagnosis of mental disorders. This activity was the culmination of several decades of growing dissatisfaction with the outcomes of traditional unstructured interviews.

By the 1970s it was recognized that by using such methods, clinicians commonly arrived at dissimilar diagnoses, and rates of diagnostic agreement were no better than could be expected by chance (see Beck, Ward, Mendelson, Mock, & Erbaugh, 1962; Spitzer & Fleiss, 1974). Clearly, this state of affairs hampered advancement of knowledge about psychopathology. Improving the reliability of psychiatric diagnoses became a research priority.

Structured and semi-structured interviews are specifically designed to minimize the sources of variability that render diagnoses unreliable. In traditional unstructured interviews, the clinician is entirely responsible for determining what questions to ask and how the resulting information is to be used in arriving at a diagnosis. Substantial inconsistency in outcomes is often the result, even when explicit diagnostic criteria are available for reference.

Structured interviews address such issues by standardizing the content, format, and order of questions to be asked, and by providing algorithms to arrive at diagnostic conclusions from information obtained that are in accordance with the diagnostic framework being employed.

The use of structured and semi-structured interviews is now the standard in research settings. These strategies, administered in various ways, are also the hallmark of empirically driven clinical practice. For example, many empirically oriented clinicians administer select sections of these interviews to confirm suspected diagnoses or to rule out alternative diagnoses, particularly if time is not available to administer the full instrument.

Across inpatient, outpatient, and research settings, standardized, structured diagnostic interviews are rated positively by both respondents and interviewers (Suppiger et al., 2009). here we discuss essential issues in their evaluation and implementation.

Essential Issues

Criteria for Selecting an Interview

Several factors need to be considered when choosing a structured or a semi-structured interview. These factors are related not only to characteristics of the interview itself—such as its demonstrated psychometric qualities, degree of structure (i.e., highly structured vs. semi-structured, allowing for additional inquiry) and breadth of diagnostic coverage—but also to the context in which the interview is to be used.

Some of the potential considerations, many of them consistently identified in reviews of this literature (e.g., Blanchard & Brown, 1998), pertain to the content, format, and coverage of the diagnostic interview; the level of expertise required for its administration; and psychometric characteristics and the availability of support and guidelines for its use.

No single instrument best fits the requirements of all clinicians and researchers: When selecting an interview, health care workers must consider their specific needs, priorities, and resources. For example, it might be tempting to consider broad diagnostic coverage, excellent reliability, and validity to be essential criteria in all instances; however, each of these has the potential for drawbacks and can sometimes be mutually exclusive.

Broad diagnostic coverage (i.e., number of disorders assessed for) often comes at the cost of in-depth information about specific diagnoses—the classic “bandwidth versus fidelity” dilemma (Widiger & Frances, 1987). Reliability, or the reproducibility of results, is enhanced by increasing the degree of structure of the interview (i.e., minimizing the flexibility permitted in inquiry and format of administration).

However, this inflexibility has the potential to undermine the validity of the diagnosis. Customized questions posed by an experienced clinician may clarify responses that would otherwise lead to erroneous diagnostic conclusions. Such issues warrant consideration.

Understanding Psychometric Characteristics of Diagnostic Interviews Psychometric qualities are a foremost consideration in judging the worth of any measurement instrument and are equally important to consider when critically evaluating the diagnoses generated by structured and semi-structured interviews.

Reliability

The “reliability” of a diagnostic interview refers to its “replicability,” or the stability of its diagnostic outcomes. As already discussed, the historically poor reliability of psychiatric diagnoses was a principal basis for the development of structured interview techniques, and this issue continues to be of foremost importance.

In light of this, researchers and clinicians should keep in mind that reliability is not an integral feature of a measurement instrument; it is a product of the context in which it was produced. Thus, reliability estimates are truly meaningful only to other applications of the interview that have comparable circumstances (e.g., administration format, training of interviewers, population). Each study should attempt to establish some form of reliability within its particular constraints. The same caveat applies to the issue of validity.



Inconsistency in diagnoses can arise from multiple sources (see Ward, Beck, Mendelson, Mock, & Erbaugh, 1962, for a seminal discussion), and two of these are particularly worth noting. “Information variance” derives from different amounts and types of information being used by different clinicians to arrive at the diagnosis. “Criterion variance” arises from assembly of the same information being assembled in different ways by different clinicians to arrive at a diagnosis, and from the use of different standards for deciding when diagnostic criteria are met. Another source of diagnostic inconsistency is “patient variance,” or variations within the respondent that result in inconsistent reporting or clinical presentation.

Two strategies are principally used to test the reliability of diagnostic interviews. “Inter-rater” (or “joint”) reliability is the most common reliability measure used in this area; here, two or more independent evaluators usually rate identical interview material obtained through either direct observation or videotape of a single assessment; in this case, there is only one set of responses to be interpreted and rated.

In contrast, “test–retest reliability” involves the administration of a diagnostic interview on two independent occasions, usually separated by a maximum of 2 weeks and often conducted by different evaluators. This less commonly used strategy of the two is a more stringent test of reliability, because variability is potentially introduced due to inconsistencies in styles of inquiry or in respondents’ self-reports. For example, whereas some respondents may attempt to be overly self-consistent, others may be primed by the initial interview and report novel information at retest. There is also a growing body of evidence that discrepant reporting at retest is due to systematic “attenuation”—that is, respondents’ increased tendency to say “no” to items endorsed in the initial interview, perhaps due to their learning more about the nature and purpose of the interview as they gain experience (Lucas et al., 1999).

Interpretation of reports of test–retest reliability is sometimes made difficult due to variations in the methods employed. For example, if supplemental questions are permitted in the follow-up interview to resolve diagnostic ambiguities (e.g., Helzer et al., 1985), the question arises as to whether data should be considered evidence of test–retest reliability rather than a form of validity, as discussed later.

Interpretability of results is also made challenging by a lack of consistency in the usage of the terms. For example, reliability studies may be described as having a test–retest design only if re-administration at retest is conducted by the same rater (see Segal & Falk, 1998) or even when different raters are used (see Rogers, 2001).

Whether test–retest or joint interview designs are employed, the statistic most commonly used to report the degree of reliability observed is Cohen’s kappa. Different kappa statistics can be used in different circumstances, such as when several diagnostic categories are possible, and when multiple raters’ assessments are being compared. The kappa index is superior to a measure such as percentage of agreement, because it corrects for chance levels of agreement; this correction can lead to highly variable kappa values due to differing base rates, however. Essentially, the lower the base rate (or higher, if the base rate is greater than 50%), the lower the kappa, posing a problem for researchers interested in phenomena where the base rates are generally low, such as psychiatric diagnoses. For this reason, another statistic, Yule’s Y, is sometimes used because of its greater stability with low to medium base rates (Spitznagel & Helzer, 1985).

Intra-class correlation coefficients (ICCs) are also sometimes reported as an index of diagnostic reliability; these are calculated based on variance in ratings accounted for by differences among clinicians and are best used with large samples. Kappa coefficients range in value from –1.00 (perfect disagreement) to 1.00 (perfect agreement); a kappa of 0 indicates agreement no better or worse than chance.

Conventional standards for interpreting kappa values suggest that values greater than .75 indicate good reliability, those between .50 and .75 indicate fair reliability, and those below .50 denote poor reliability (Spitzer, Fleiss, & Endicott, 1978). However, there is some disagreement regarding these benchmarks. Landis and Koch (1977) proposed that kappas within the range of .21 to .40 suggest fair agreement. In summary, there are no definitive guidelines for the interpretation of the kappa statistic; however, researchers usually consider kappas of .40 to .50 as the lower limits of acceptability for structured interviews.

The reliability of a diagnostic interview is determined by many factors. These include
  • ·         The clarity and nature of the questions asked and how well they are understood by the respondent,
  • ·         The degree and consistency of training and experience of interviewers,
  • ·         The conditions in which the interview is conducted,
  • ·         The type of reliability assessed (e.g., test–retest, interrater),
  • ·         The range and complexity of disorders under investigation, and
  • ·         The base rate (or prevalence) of the diagnosis in the target population.

Validity

The validity of a diagnostic interview is closely bound to the validity of the diagnostic framework it operationalizes. If the way a disorder is conceptualized by, for example, the text revision of the fourth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV-TR; American Psychiatric Association, 2000) is problematic, a structured interview that loyally adheres to this framework will be invalid, no matter how psychometrically sound it is.

Thus, the matter of “validity” encompasses much larger issues than simple psychometrics and pertains to the very conventions adopted in framing and defining mental disorders (see Widiger & Clark, 2000, for a discussion).

Much early work focused on the validity of alternate diagnostic frameworks and criteria (e.g., Feighner criteria, or Research Diagnostic Criteria [RDC] vs. DSM; see Feighner et al., 1972; Spitzer, Endicott, & Robins, 1978), or how well they captured the core characteristics of mental disorders. This research focus, though not its underlying premises, has been rendered somewhat obsolete by the widespread adoption of DSM as the predominant psychiatric nosology.

Most contemporary research on the validity of structured interviews revolves around the issue of how well they approximate the DSM standard. Even presupposing the validity of the diagnostic framework used, determining the “validity” of a diagnostic instrument, or how accurately it assesses the conditions it purports to assess, poses a considerable challenge for researchers. Primarily, this is because there is no infallible criterion index (i.e., “gold standard”) with which interview-generated diagnoses can be compared.

Conventional strategy for investigating the validity of a measurement instrument consists of comparing its outcomes to those of another source, known to be a valid index of the concept in question. In the case of diagnostic interviews, other sources of information about diagnoses might include expert diagnosis and/or clinical interview, chart review, or other diagnostic
interviews or indexes. Therein lies the problem.

Other diagnostic instruments may themselves suffer from psychometric weaknesses, and reliance on clinical diagnosis as an ultimate criterion seems misguided, begging the question of why researchers began to use structured interviews in the first place. Indeed, Robins, Helzer, Croughan, and Ratcliff (1981) referred to such procedures as “bootstrapping,” or using one imprecise method to improve the classificatory accuracy of another.

In light of these issues, Spitzer (1983) proposed the LEAD standard—Longitudinal observation by Experts using All available Data—as an optimal method to establish the procedural validity of a diagnostic instrument. “Procedural validity” in this case refers to the congruence between diagnoses generated by structured interview versus expert clinicians.

The LEAD standard, also known as a “best estimate” diagnosis, incorporates data collected longitudinally from interviews, chart review, and other informants. Expert clinicians then use all available data to come to a consensus diagnosis, which serves as the criterion measure.
Unfortunately, this rigorous method is time-consuming and expensive to apply, and has not been widely adopted in validation research to date (see Booth, Kirchner, Hamilton, Harrell, & Smith, 1998, for an exception).

There are three principal categories of procedures for determining a test’s validity: content-related, construct-related, and criterion-related. In contemporary research on diagnostic interviews, the chief focus has been on the latter category, with several forms of particular relevance. Although rarely seen outside of the diagnostic assessment literature, the term “procedural validity” is generally used to denote the degree of congruence between diagnoses generated by structured interview versus expert clinicians. “Concurrent validity” refers to the degree of correlation between scores on the interview in question and scores on another established instrument administered simultaneously. “Predictive validity” denotes the degree to which ratings on the interview are associated with a specified criterion over a time interval (e.g., diagnostic status of the individual or intervening course of the disorder, at follow-up).

There is some inconsistency in the use of this terminology, however. It is at times difficult to
determine the comparability of validation results, because researchers have reported them using different terms. On a more basic level, it has been suggested that the very term “validity” is often erroneously used in this literature (Malgady, Rogler, & Tryon, 1992), in part because of reference to data better regarded as evidence of a diagnostic interview’s reliability.

Statistics commonly reported in the context of validity research include the following:
(1) “Specificity,” or the percentage of non-cases of a disorder that has been identified correctly (i.e., poor specificity results in over-detection);
(2) “Sensitivity,” or the percentage of true cases of a disorder that has been identified correctly (i.e., poor sensitivity results in under-detection); specificity and sensitivity figures are proportional to the total number of non-cases and cases, respectively, identified by the
instrument;
(3) “Positive” and “negative predictive values,” or the probability that individuals positive or negative for a diagnosis, according to the instrument being validated, are similarly identified according to the criterion; and
(4) “Hit rate,” or the number of correct classifications relative to the total number of classifications made. The kappa statistic is commonly reported as a general index of agreement.

In summary, an understanding of the ways in which reliability and validity are defined and evaluated in the literature on psychiatric diagnosis is essential when appraising the relative merits of the many standardized interviews currently available.

References

Martin m. Antony & David H. Barlow, 2010, Handbook of assessment and treatment planning for Psychological disorders, Second edition, structured and semi-structured diagnostic interviews, The Guilford Press

Read Also


No comments:

Post a Comment