The format of the original Hamilton Rating Scale for Depression (HAM-D) was unstructured: only general instructions were provided for rating individual items. Over the years, a number of modified versions of the HAM-D have been proposed. They differ not only in the number of items, but also in modalities of administration. Structured versions, including item definitions, anchor points and semi-structured or structured interview questions, were developed. This comprehensive review was conducted to examine the clinimetric properties of the different versions of the HAM-D. The aim was to identify the HAM-D versions that best display the clinimetric properties of reliability, validity, and sensitivity to change. The search was conducted on MEDLINE, Scopus, Web of Science, and PubMed, and yielded a total of 35,473 citations, but only the most representative studies were included. The structured versions of the HAM-D were found to display the highest inter-rater and test-retest reliability. The Clinical Interview for Depression and the 6-item HAM-D showed the highest sensitivity in differentiating active treatment from placebo. The findings indicate that the HAM-D is a valid and sensitive clinimetric index, which should not be discarded in view of obsolete and not clinically relevant psychometric criteria. The HAM-D, however, requires an informed use: unstructured forms should be avoided and the type of HAM-D version that is selected should be specified in the registration of the study protocol and in the methods of the trial.
The Hamilton Rating Scale for Depression (HAM-D) is the most widely used clinician-rated scale for the assessment of depression severity in patients who were already diagnosed with a depressive disorder . The first version of the HAM-D was originally published by Max Hamilton in 1960 and consisted of 21 items, the HAM-D21 . The number of citations of this version of the scale exceeds 21,000 on Scopus. Hamilton himself recommended, however, to use only the first 17 items of the HAM-D21 since the last 4 symptoms (i.e., diurnal variation, depersonalization/derealization, paranoid and obsessional/compulsive symptoms) were either not considered part of the disease, or they were relatively uncommon, or they were not considered features related to depression severity [2, 3]. Over the years, a number of modifications of the scale have been proposed and several versions of the HAM-D have been developed and come into use [4, 5].
Different Versions of the HAM-D
The main aspects differentiating the various versions of the HAM-D that were developed over the years are the following:
The versions of the HAM-D that are available differ in the number, sequence, and wording of items, including the scoring procedure. The items of the most widely used versions of the HAM-D are reported in Table 1. Rosenthal and Klerman  proposed the first modified version of the HAM-D, which included an additional item on worthlessness. Paykel et al.  introduced substantial modifications with the development of the Clinical Interview for Depression (CID), an expanded version of the HAM-D encompassing 36 items [8, 9]. Kovacs et al.  referred to a 24-item version of the HAM-D, the HAM-D24, which, in addition to worthlessness, also contained symptoms of helplessness and hopelessness. Gelenberg et al.  used an expanded version of the HAM-D, consisting of 27 items, the HAM-D27, which covered symptoms of atypical depression. Similarly, Thase et al.  introduced another expanded version of the HAM-D to evaluate reverse neurovegetative symptoms such as hypersomnia, increased appetite, and weight gain. Authors also introduced shortened versions of the HAM-D, as pioneered by Bech et al. [13‒15] and his research group. They demonstrated the validity of the 6-item version of the HAM-D (HAM-D6), the HAM-D6, for the assessment of the core (central) symptoms of depression [13‒15]. Maier and Philipp  reported similar results, since they identified a 6-item version of the HAM-D, which had 5 items in common with the version proposed by Bech et al. [13‒15]. In a study from 1993, Gibbons et al.  proposed another short version of the HAM-D, including 8 items, which were found to be a unidimensional measure of depression severity.
Unstructured and Structured Versions
Modalities of administration included unstructured versions with ratings only [2, 3, 11], devoid of interview guides and anchor points; structured versions with at least anchor points , supplemented by semi-structured [8, 19, 20] or structured interview questions [21‒28].
The format of the original version of the HAM-D, the HAM-D21 , was unstructured: Max Hamilton provided only general instructions for rating the individual items of the HAM-D21. As he stated , the interview process “depends entirely on the skill of the interviewer in eliciting the necessary information”. In other terms, the unstructured versions of the HAM-D, particularly the HAM-D21 and the HAM-D17 [2, 3], exclusively rely on expertise and clinical judgment of raters.
The version developed by Bech et al. , the HAM-D23, explicitly provided item definitions and “operational criteria” (i.e., anchor points) for rating each item of the HAM-D. Anchor points (most of them ranging from 0 to 4) were introduced for detecting the presence and severity of depressive symptoms .
In the mid-sixties, Eugene S. Paykel started development of the CID  to provide a comprehensive assessment of depression. The CID includes item definitions, specific anchor points rated on 7-point scales, and a semi-structured interview, with specified initial questions for each item, which may be modified if circumstances necessitate, and further probing whenever a symptom is present [8, 9]. Morriss et al.  published a structured version of the HAM-D with semi-structured interview, which consisted of 17 items and included anchor points for rating the severity and frequency of depressive symptoms. Moreover, Timmerby et al.  introduced a structured version of the HAM-D6, which included anchor points and interview guides for evaluating the severity of core symptoms of depression.
Further versions of the HAM-D including structured interviews were also developed [21‒28] with the purpose of improving inter-rater reliability, as well as individual item reliability. In the appendix of a book on interpersonal psychotherapy, Klerman et al.  introduced the first structured version of the HAM-D, the Interview Format for the HAM-D21. Miller et al.  developed the Modified Hamilton Rating Scale for Depression (MHRSD), a 25-item structured interview guide, which included additional items for the assessment of cognitive and melancholic symptoms of depression and anchor points . Whisman et al.  published a 17-item version of the HAM-D with structured interview, the Diagnostic Interview Schedule (DIS)-HRSD, which they integrated with the National Institute of Mental Health DIS . Potts et al.  modified the original unstructured version of the HAM-D17  to be suitable for a structured interview, the SI-HDRS. Williams [22, 25] published the Structured Interview Guide for the HAM-D21, the SIGH-D, which consisted of a set of standard questions and anchor points for rating the frequency and severity of depressive symptoms. Some years later, Williams et al.  published the Structured Interview Guide for Seasonal Affective Disorders, the SIGH-SAD, a 29-item clinician rated scale, which was specifically developed for the assessment of symptoms of atypical depression (e.g., hyperphagia, hypersomnia). A revised version of the SIGH-D, in which 3 items were added and anchor points were provided for assessing symptoms of hopelessness, helplessness, and worthlessness, was developed by Moberg et al. . Williams et al.  also introduced the GRID-HAM-D, a structured interview guide, in which the severity and frequency of depressive symptoms are rated separately. The GRID-HAM-D scoring system also included the following modifications: the item content was further clarified and anchor point descriptions were provided with clinical examples at each severity level .
Reviews on the HAM-D
A number of reviews were conducted on the HAM-D [1, 20, 32‒48]. Hedlund and Vieweg  performed the first one in 1979. However, most of these reviews focused on the psychometric properties of the HAM-D [33, 38, 40, 42, 43, 48]. The authors of such reviews did not analyze the different versions of the HAM-D and, most importantly, they did not evaluate the clinimetric properties of these rating scales [38‒40, 46, 48]. To the best of our knowledge, only Williams  conducted a comprehensive review to evaluate the validity of the different versions of the HAM-D. However, also in this case, the author did not focus on the clinimetric features of these rating scales . Timmerby et al.  conducted a systematic review on the clinimetric properties of the HAM-D, in which they compared the HAM-D6 and the HAM-D17 without differentiating between unstructured and structured versions. There is, therefore, a need for an updated and comprehensive work, particularly to address the clinimetric properties of the various versions of the HAM-D.
From Psychometrics to Clinimetrics
It was Feinstein [50‒52] who coined the term “clinimetrics” to introduce an innovative approach, which has been defined as the science of clinical measurements . Such a clinically based evaluation method is particularly useful for testing a number of measurement properties (e.g., sensitivity, scalability, clinical validity), which do not find room in the traditional psychometric model [53‒55]. Homogeneity of components, as measured by statistical analyses such as Cronbach’s alpha coefficients and factor analyses, has been considered the most important requirement for a psychometric rating scale [53‒60]. In the psychometric model, redundant items (i.e., questions which are highly correlated to each other) are needed for ensuring such homogeneity of components . However, the same properties that give a scale a high score for homogeneity may obscure its clinical utility, particularly its ability to detect change [55, 61]. In the clinimetric approach, homogeneity of components is not needed and what matters is the sensitivity of the rating scale, that is, its ability to discriminate between active therapy and placebo, to differentiate patients from healthy controls, to discriminate between different groups of patients, to differentiate the severity of symptoms (e.g., certain symptoms may be more troublesome or incapacitating than others), to detect clinically relevant changes in drug or psychotherapy trials [53, 55, 62‒65]. Such clinimetric properties are particularly important when treatment effects are small and in the evaluation of sub-clinical symptoms .
The present comprehensive review of studies was conducted to describe and evaluate the clinimetric properties of different versions of the HAM-D. The major aim of this review was to identify the versions of the HAM-D that best display the clinimetric properties of reliability, validity, and sensitivity to change in the assessment of depression.
In view of the amount of literature on this topic, the present review cannot be systematic, but will analyze the most representative studies. Since the CID  presents substantial differences from the HAM-D and its clinimetric properties have already been analyzed in a previous review study , comparisons between the HAM-D scales and the CID will be briefly described in the results section, and then discussed.
A comprehensive search of the literature was conducted on the following databases: MEDLINE, Scopus, Web of Science, and PubMed. Each database was searched from inception to July 23, 2019. A manual search of the literature was also performed and reference lists of the retrieved articles were examined for further studies not yet identified. The following search terms were used: “Hamilton Rating Scale for Depression,” “HRSD,” “Hamilton Depression Rating Scale,” “HAM-D.” They were combined using the Boolean operator “OR.”
We selected and analyzed only those studies, which focused on the clinimetric properties of the HAM-D. To be included in the review, studies had to meet the following inclusion criteria:
English-language article published in a peer-reviewed journal.
The study was published as a full-text article.
The article was an original study (e.g., research article, meta-analysis).
The study evaluated the clinimetric properties of the HAM-D or used a clinimetric approach to analyze the clinical utility of this rating scale.
Study Selection Procedure
The first 2 authors (D.C. and C.P.) independently performed the search, screened titles and abstracts, selected studies, evaluated the full-text of articles appearing potentially relevant, and extracted data from studies meeting the eligibility criteria. In case of disagreement, a consensus was reached through discussion with the last author (J.G.).
The initial search yielded a total of 35,473 articles, but only those studies, which best displayed the clinimetric properties of the HAM-D, were included in the review. Accordingly, the inter-rater and test-retest reliability, the discriminant validity, the sensitivity to change, the scalability, and the concurrent validity of the various versions of the HAM-D will be examined in detail.
Trajković et al.  conducted a meta-analytic study to examine the inter-rater reliability of the HAM-D. They found a pooled mean intra-class correlation coefficient (ICC) of 0.92, indicating an excellent level of inter-rater reliability. Unfortunately, they did not discriminate between the different versions of the HAM-D .
Hamilton  examined the inter-rater reliability of his unstructured version of the HAM-D17. He found that the correlation between 2 raters who independently evaluated 10 depressed patients was 0.84 . He used skilled interviewers who were experienced in the use of the HAM-D17 . In a study performed in 1975, Bech et al.  adopted the same procedure: 2 experienced psychiatrists independently scored the unstructured HAM-D17. The inter-rater reliability was found to be excellent with a Spearman’s correlation coefficient of 0.94 . O’Hara and Rehm  reported similar results. They demonstrated that the skill level or expertise of the rater positively affected the inter-rater reliability of the HAM-D17 . They indeed showed that the agreement was high (ICC of 0.91) for expert raters but low for novice raters (ICC 0.76) . Many subsequent studies confirmed this trend: the inter-rater reliability of the unstructured versions of the HAM-D, particularly that of the HAM-D21  and of the HAM-D17, was largely influenced by the clinical experience of raters: the greater the expertise of raters, the higher the inter-rater reliability [67‒81].
Unstructured versus Structured Versions
Compared to the unstructured HAM-D17, the inter-rater reliability of those versions including anchor points and interview guides was significantly higher independently if interviewers were experienced clinicians or novice raters [19, 21, 22, 27, 82]. Potts et al.  evaluated the inter-rater reliability of their structured version of the HAM-D17, the SI-HDRS, and found that the level of agreement between inexperienced raters was excellent: ICC of 0.92. Moberg et al.  compared the unstructured HAM-D24 with their structured version of this rating scale. They found that the structured HAM-D24 produced significantly higher levels of inter-rater reliability than the unstructured one . Williams et al.  conducted a similar study. They compared the inter-rater reliability of the GRID-HAM-D with that of the unstructured version of the HAM-D17 and found that the ICC (0.78) for the unstructured HAM-D17 was significantly lower (p = 0.001) than the ICC (0.95) for the structured GRID-HAM-D . Tabuse et al.  provided further support to the inter-rater reliability of the GRID-HAM-D. They showed that this structured version of the HAM-D displayed excellent inter-rater reliability for both inexperienced and experienced raters before and after training .
Structured Versions with Anchor Points Only
The majority of studies examined the inter-rater reliability of the structured version of the HAM-D, which used the anchor points introduced by Bech et al. . Kørner et al.  conducted one of the first studies in this regard. They found an ICC of 0.83, demonstrating a high level of agreement between raters . Koenig et al.  and Fuglum et al.  reported similar results. The former authors  found inter-rater correlations between 0.93 and 0.98, while the latter researchers  showed an ICC of 0.81, in both cases indicating a high level of inter-rater reliability. Bent-Hansen and Bech  performed a similar study and found an ICC of 0.95, indicating a high level of inter-rater reliability.
Structured Versions with Semi-Structured Interview
Paykel et al.  examined the inter-rater reliability of the CID and found a Pearson’s correlation coefficient of 0.81, indicating a satisfactory level of inter-rater agreement. The inter-rater reliability of the CID was evaluated in other studies, yielding similar results: the agreement between raters was high, with mean correlation coefficients ranging from 0.81  to 0.82 . Morriss et al.  evaluated the inter-rater reliability of a structured version of the HAM-D17 and found ICCs ranging from 0.89 to 0.96, indicating excellent inter-rater reliability.
Structured Versions with Structured Interview
Many studies examined the inter-rater reliability of the different structured versions of the HAM-D. Whisman et al.  showed that the inter-rater reliability of their structured version of the HAM-D, the DIS-HRSD, was satisfactory. Specifically, they found an ICC of 0.84, indicating a high level of agreement between raters . Akdemir et al.  examined the inter-rater reliability of the SIGH-D, the structured interview guide published by Williams . They found Pearson correlation coefficients ranging from 0.87 to 0.98, indicating a high level of agreement between raters . Rohan et al.  evaluated the inter-rater reliability of the SIGH-SAD, another structured interview developed by Williams and her research group . They found ICCs ranging from 0.92 to 0.96, indicating an excellent level of inter-rater reliability . As to the inter-rater reliabilities of individual items of the HAM-D structured versions, authors reported similar results [21, 22, 25, 27]. The inter-rater reliabilities of the individual items included in the structured versions of the HAM-D were significantly higher than those obtained in studies, in which anchor points and interview questions were not used [21, 22, 25, 27].
Test-retest reliability refers to the ability of an assessment instrument to produce the same results over time, while it is assumed that the clinical dimension under examination has remained unchanged [91, 92]. In clinimetrics, such a measurement property is not considered to be as important as other clinimetric properties since the rating scale is primarily intended to be used for detecting treatment changes [9, 55].
Cicchetti and Prusoff  were among the first authors to examine the test-retest reliability of an unstructured 22-item version of the HAM-D. They found poor to fair levels of test-retest reliability for most of the HAM-D items . Craig et al.  evaluated the test-retest reliability of the unstructured version of the HAM-D17 in a small sample of inpatients with schizophrenia. Using raters who had previously received training in the use of the HAM-D17, they showed high test-retest reliability for the HAM-D17 total score, with a correlation coefficient of 0.65 . However, analyzing the individual items of the HAM-D17, poor test-retest reliability was found for items on guilt feelings, somatic symptoms (gastrointestinal), genital symptoms, hypochondriasis, and weight loss .
Williams  demonstrated that the use of her structured version of the HAM-D, the SIGH-D, significantly improved the test-retest reliability for most of the SIGH-D items. More specifically, compared to the study by Cicchetti and Prusoff , Williams  showed that all but 3 (late insomnia, psychomotor retardation, and agitation) of the 21 SIGH-D items had better test-retest reliability. Using their structured version of the HAM-D17, the SI-HDRS, Potts et al.  reported similar results: all but 4 items (i.e., loss of weight, insight, psychomotor agitation, and psychomotor retardation) of the 17 SI-HDRS items had satisfactory test-retest reliability. Other studies demonstrated that the different structured versions of the HAM-D produced uniformly higher test-retest reliabilities than the unstructured ones. Akdemir et al.  investigated the test-retest reliability of the SIGH-D and found a correlation coefficient of 0.85, indicating high test-retest reliability for the SIGH-D total scores. They also showed that the use of this structured version improved the test-retest reliability for all but one (i.e., loss of weight) of the 17 HAM-D items . Shankman and Klein  examined the test-retest reliability of the MHRSD, the structured interview of the HAM-D developed by Miller et al. . They found that the MHRSD had excellent test-retest reliability, with an ICC of 0.96 . Williams et al.  tested the test-retest reliability of their structured GRID-HAM-D. They showed satisfactory level of test-retest reliability, with an ICC of 0.81 .
According to the clinimetric approach, the validity of a rating scale is established using the global assessment of the experienced clinician as the main index of validity [55, 91, 94]. The aim is to determine whether the items included in the rating scale reflect the clinician’s judgment of severity of the clinical condition under assessment [91, 95].
Chipman and Paykel  found that specific individual items of the CID correlated with the clinician’s global assessment of depression severity. They indeed reported that patients rated as more severely depressed by clinicians were those reporting higher scores on the following items of the CID: psychomotor retardation, depressive delusions, agitation, guilt, initial insomnia, hopelessness, suicidal tendencies, verbal complaint of depressed feelings, observed appearance of depression, and less short-term reactivity of mood . Bech et al.  conducted a similar study for evaluating the validity of the unstructured version of the HAM-D17. They demonstrated that only 6 of the 17 items of the HAM-D (those included in the HAM-D6) reflected the clinician’s evaluation of depression severity . More specifically, only the items of being depressed, having guilt feelings, experiencing lack of interest and fatigue, displaying psychomotor retardation, and suffering from psychic anxiety corresponded to symptoms used by experienced clinicians to formulate a global evaluation of depression severity . Therefore, the total score of the HAM-D6 was found to be strongly associated with the clinician’s global impression of depression severity . Further studies showed that the HAM-D6 sensitively captured core symptoms of depression better than the HAM-D17, that actually covers a mixture of anxiety and depressive symptoms, including side effects of pharmacological treatments such as nausea, weight gain, and sexual dysfunction [13‒15, 20, 67, 96‒100].
As to studies evaluating the ability to discriminate between different groups of patients suffering from the same illness, Carroll et al.  were among the first authors to demonstrate that the total score of the unstructured version of the HAM-D17 sensitively differentiated severely depressed inpatients from moderately and mildly depressed outpatients. In other words, they showed that in-patients with severe depression scored significantly higher on the HAM-D17 than the other 2 groups of depressed patients . Knesevich et al.  reported similar findings. Using the global judgment of experienced clinicians, they allocated a small sample of 26 depressed patients into 4 severity groups: none, mild, moderate, and severe . Then, they showed that the total score of the unstructured version of the HAM-D17 sensitively differentiated between the 4 different levels of depression severity . Thase et al.  showed that the unstructured version of the HAM-D17 sensitively differentiated patients with endogenous depression from those with nonendogenous depression. They found that patients with endogenous depression scored significantly higher on the HAM-D17 than patients with nonendogenous depression . Zheng et al.  used the Global Assessment Scale developed by Endicott et al.  to evaluate depression severity and then they tested the discriminant validity of the unstructured version of the HAM-D17. They showed that the HAM-D17 total score sensitively discriminated between different levels of depression severity . They indeed demonstrated that patients with higher HAM-D17 scores were likely to be more severely disabled according to the Global Assessment Scale .
Depressed Patients versus Healthy Controls
As to studies examining the ability of the HAM-D to differentiate patients from healthy subjects, Ganchrow et al.  showed that the unstructured HAM-D17 sensitively discriminated patients with depression from controls. Fava et al.  conducted a similar study, in which they demonstrated that only 17 items of the unstructured version of the HAM-D21 (the first 16 questions and the item on diurnal variation of mood) sensitively discriminated depressed patients from healthy controls. Rehm and O’Hara  reported similar results. The total score of the unstructured version of the HAM-D17 sensitively differentiated depressed patients from healthy controls . However, they showed that 4 items (i.e., agitation, gastrointestinal symptoms, loss of insight, and weight loss) failed to discriminate between depressed patients and controls .
Different Groups of Patients
As to studies analyzing the ability of the various versions of the HAM-D to discriminate between different groups of patients, Rush et al.  found that the total score of the unstructured version of the HAM-D17 sensitively differentiated patients with major depression from those with other psychiatric diagnoses (e.g., bipolar disorder, schizophrenia, generalized anxiety disorder, panic disorder).
Carneiro et al.  showed that only 4 items of the unstructured HAM-D17 (i.e., insomnia late, general somatic symptoms, hypochondriasis, and insight) sensitively discriminated depressed patients from bipolar I patients.
In the clinimetric approach, to be considered clinically meaningful, cut-off scores should be tested using the global judgment of the experienced clinician as the gold standard [91, 108].
Using the judgment of the experienced clinician as the main index of validity, Zimmerman et al.  established score ranges for the HAM-D17 reflecting different levels of depression severity. They found that a score ranging from 8 to 16 corresponded to mild depression, while a score range from 17 to 23 reflected moderate depression . They also showed that a score ≥24 on the HAM-D17 was indicative of severe depression . Using the same approach (i.e., global impression of the experienced clinician as the main index of validity), Kyle et al.  established cut-off scores for remission in the different versions of the HAM-D. Cut-off scores, which were found to be clinically valid indicators of remission, were the following: a score of <5 for the HAM-D6, a cut-off score of <8 for the HAM-D17, a score of <9 for the HAM-D21, and a cut-off score of <10 for the HAM-D24 . Bobo et al.  applied the same clinimetric approach but, in this case, they focused on the clinician’s global impression of improvement to identify cut-off scores indicative of a clinically significant level of change in the unstructured HAM-D17 and HAM-D6. They found that a clinician’s global evaluation of improvement was associated to a reduction of 11 points in the HAM-D17 scores, corresponding to a percent reduction of 50–57% in the HAM-D17 scores . As to the HAM-D6, the clinician’s global impression of improvement was found to be associated to an absolute reduction of 7 points in the HAM-D6, corresponding to a percent reduction of 57–63% .
Discriminant Validity of the HAM-D17 Compared to the CID
A number of studies compared the discriminant validity of the HAM-D17 with that of the CID in general practice (GP). In a study by Freeling et al. [110, 111], patients whose major depression had gone unrecognized by their physicians appeared to be significantly less severely depressed on the CID, but not on the HAM-D17, than those whose depression had been recognized. On both the HAM-D17 and the CID patients with unrecognized depression showed less evidence of overt depressed mood. On the HAM-D17 they showed greater lack of insight, whereas on the CID they were less obviously depressed at interview based on their appearance. Patients with unrecognized depression had also higher scores on the CID reactivity to social environment and distinct quality of mood. When depressed patients receiving a new prescription of an antidepressant in GP were compared with those given other treatments, and with antidepressant-treated psychiatric outpatients , the mean HAM-D17 and CID depression scores were considerably higher in the outpatients than in the 2 GP samples. Significant differences were found also between the 2 GP samples on both scales, with higher scores for antidepressant-treated GP patients compared to those receiving other treatments. The CID provided a detailed description of specific symptom patterns for each subgroup. Differences between GP female patients with recognized and unrecognized depression in their symptom ratings were found for 2 individual items of the CID (i.e., tiredness and distinct quality of depressed mood), but not on the HAM-D17 .
Further, significantly slower improvement 2 weeks after admission was detected with the CID in depressed inpatients with comorbid personality disorders compared to those without, even though differences did not reach significance on the HAM-D17 . In another study , both the HAM-D17 and the CID sensitively discriminated between acutely and remitted depressed patients.
Sensitivity to Change
The differentiation between an active drug and placebo [54, 63, 64, 116, 117] or between a specific psychotherapeutic treatment and attention placebo or clinical management (CM)  is particularly important when treatment effects are small and in the setting of subclinical symptoms. Kellner and Sheffield  used the term “sensitivity” to describe this clinimetric property.
Comparison with Observer-Rated Scales
Montgomery and Åsberg  were among the first authors who criticized the unstructured version of the HAM-D17  for being poorly sensitive in detecting change during treatment. They developed a new rating scale, the Montgomery-Åsberg Depression Rating Scale (MADRS) , which was specifically designed to be more sensitive than the HAM-D17. Actually, when Khan et al.  analyzed records of 208 depressed patients, who participated in 8 randomized, placebo-controlled, double-blind clinical trials between 1996 and 2000, they found that the MADRS was just as sensitive as the HAM-D17 in differentiating between antidepressants and placebo. Studies also showed that the HAM-D6, the short version of the HAM-D introduced by Bech et al. [13‒15], was more sensitive than the MADRS [119‒125]. The HAM-D6, but not the MADRS, was sensitive to the superior antidepressant efficacy of mirtazapine over trazodone . In another study testing the antidepressive effects of hypericum, Lecrubier et al.  showed that the HAM-D6, but not the MADRS, sensitively discriminated between active drug and placebo. Similarly, Liebowitz et al.  found that the HAM-D6, but not the MADRS, detected the antidepressant superiority of desvenlafaxine over placebo. The sensitivity to change of the HAM-D6 has been tested in a number of other clinical trials [121, 123, 126‒132].
Helmreich et al.  compared the sensitivity of the unstructured HAM-D17 with that of the 28-item clinician version of the Inventory of Depressive Symptomatology, the IDS-C28, a relatively new rating scale, which was developed by Rush and his research group [106, 134] for evaluating signs and symptoms of depression. Using data from 340 patients in a 10-week randomized, placebo-controlled trial comparing the effectiveness of sertraline and cognitive-behavioral therapy, they found that the IDS-C28 was more sensitive than the HAM-D17 in detecting small changes in depression symptomatology over the treatment course . Liu et al.  compared the sensitivity of the unstructured HAM-D17 with that of the 16-item clinician version of the Quick Inventory of Depressive Symptomatology, the QIDS, another rating scale, which was developed by Rush and his research group [136, 137]. Evaluating depression at baseline and 6 weeks later, they found that the QIDS and the HAM-D17 were equally sensitive to change of depressive symptoms .
Sensitivity of HAM-D6 Compared to that of HAM-D17
Evaluating the antidepressant effects of citalopram, Østergaard et al.  found that the HAM-D6 was more sensitive than the HAM-D17 in capturing the antidepressant effects of citalopram. In another study, Bech et al.  found that the HAM-D6, but not the HAM-D17, was sensitive to the superior antidepressant efficacy of bupropion over buspirone. Many other studies reported similar results: the HAM-D6, but not the HAM-D17, was found to sensitively differentiate between different antidepressant effects [120‒122, 124‒126, 140‒146].
Differentiating Active Treatment from Placebo
Studies showed that the HAM-D6, but not the HAM-D17, sensitively discriminated between active drug and placebo [139, 147‒149]. In Chouinard et al. , brofaromine was statistically (p < 0.050) superior to placebo on the HAM-D6, but did not significantly differ from placebo on the HAM-D17. Fabre et al.  obtained similar results. Using the HAM-D6, they showed that sertraline was significantly superior to placebo at all 3 doses (i.e., 50, 100, and 200 mg daily) . This finding was not replicated when they used the HAM-D17 . In a subsequent study  the HAM-D6, but not the HAM-D17, sensitively discriminated active treatment from placebo. More specifically, using the HAM-D6, Feiger et al.  found that selegiline was statistically (p < 0.01) superior to placebo in decreasing symptoms of major depression.
Comparison with Self-Rating Scales
Studies compared the sensitivity of the various versions of the HAM-D with that of self-reported questionnaires. Carroll et al.  were among the first authors who compared the sensitivity of the unstructured version of the HAM-D17 with that of the Zung Self-Rating Depression Scale. They showed that the HAM-D17, but not the Zung Self-Rating Depression Scale, sensitively discriminated severely depressed inpatients from outpatients with moderate or mild depression . Edwards et al.  conducted a meta-analysis, in which they compared the sensitivity of the unstructured version of the HAM-D17 with that of the Beck Depression Inventory (BDI), one of the most widely used questionnaires for the assessment of self-reported symptoms of depression . They found that the HAM-D17 was more sensitive to change than the BDI . Other studies reported similar results: the HAM-D17, but not the BDI and the Zung Self-Rating Depression Scale, was highly sensitive to change [32, 152].
Comparison with the CID
The HAM-D17 and the CID were used in a placebo-controlled trial of amitriptyline among depressed patients in GP [153, 154]. Several individual items of the CID showed the superiority of amitriptyline over placebo over 6 weeks of treatment, whereas only 4 items of the HAM-D17 (i.e., depressed mood, guilt, early and late insomnia) displayed significant drug-placebo differences .
In a study on relapse prevention with cognitive therapy (CT) in residual depression [156, 157], similar non-significant lower HAM-D17 and CID scores were found at 1-year follow-up in the CT + CM group compared to the CM only group. Differences between groups were most marked at the end of treatment and the next 6 months, and were not fully lost until 3 1/2 years after the end of CT . Significant time by group interactions were found at 1-year follow-up for the CID depression score and 2 individual items (i.e., guilt and self-esteem, and hopelessness and pessimism), but not for the HAM-D17 .
In the clinimetric approach, it is particularly important to test whether the items of the rating scale reflect one single dimension of severity [20, 55, 91, 159]. More specifically, item response theory models (i.e., Rasch and Mokken analyses) are required to determine the level of scalability of the rating scale [20, 55, 91, 159]. The Rasch analysis is the parametric version of item response theory models , while the Mokken analysis is the corresponding non-parametric version [91, 161]. In the Rasch model, the scalability is evaluated by using a number of fit indices such as the differential item functioning (the difficulty level of each item), the invariant item ordering (the expected score for an “easy” item is always higher than the expected score for a “hard” item) and the local independence of items (the probability of a positive score on an item should not depend from a positive score on any other item) [91, 160, 162]. When the data meet these criteria, the Rasch model assumes that both the respondent’s ability (e.g., his/her level of depression) and the degree of clinical information (e.g., the severity of depression) measured by the item are evaluated on the same scale [91, 160, 162]. Clinically, it means demonstrating that all items of the rating scale are multidimensional, that is, they measure different symptoms but of the same clinical dimension [55, 91, 163]. In the Mokken analysis, the scalability is evaluated by using the Loevinger’s coefficient . Such a clinimetric coefficient is an expression of the extent to which each item of the rating scale covers a specific (i.e., unique) level of symptom severity of an underlying clinical dimension [55, 91]. According to Mokken , a Loevinger’s coefficient ≥0.30 indicates not only that items are not redundant but also that the total score of the rating scale is a statistically sufficient and clinically valid measure of the severity of the clinical condition under assessment . The clinimetric property of scalability is thus important to differentiate the various versions of the HAM-D.
Scalability of the Different Versions of the HAM-D
Bech and his research group conducted a number of studies in which they analyzed the scalability of the various versions of the HAM-D [14, 20, 96, 98, 100, 165‒172]. In these studies, including a recent paper by da Silva et al. , the HAM-D6 was found to have an excellent level of scalability with Mokken coefficients ranging from 0.42  to 0.65 . Studies using the Rasch analysis further confirmed the scalability of the unstructured version of the HAM-D6 [96, 165, 170]. The other HAM-D versions, which were found to have an acceptable level of scalability, were the unstructured HAM-D24  and the CID . Using the Rasch analysis, Bech et al.  demonstrated that the CID contains valid subscales for the assessment of affective disorders such as depression, anxiety, and apathy.
Conflicting results were obtained for the scalability of the unstructured version of the HAM-D17. Mokken coefficients ranging from 0.24  to 0.35  were found, indicating that the HAM-D17 is a multidimensional rating scale. Similarly, Bobes et al.  and Kyle et al.  questioned the scalability of the unstructured version of the HAM-D21. They found Mokken coefficients ranging from 0.29  to 0.30 , indicating that the original version of the HAM-D, the HAM-D21, is a multidimensional rating scale.
A high correlation is often regarded as evidence that 2 rating scales measure the same clinical factor . However, a high correlation does not indicate similar clinical validity [53, 91, 108]. Two rating scales may have a common content, which insures a high positive correlation, but they may display differential sensitivity [53, 64]. The concurrent validity of the different versions of the HAM-D has been widely examined, especially by using self-reported questionnaires.
Different Versions of the HAM-D Compared to Self-Rating Scales
Prusoff et al.  were among the first authors to test the concurrent validity of the unstructured version of the HAM-D17. Using a sample of 200 depressed patients, they found that the HAM-D17 had low correlations with self-rating scales at baseline (i.e., during the acute episode of illness), while the correlations were significantly higher at follow-up, when patients improved with treatment . The authors concluded that, compared to self-rating scales, the HAM-D17 was a better measure of depression severity . Focusing on a sample of 40 depressed outpatients and 40 healthy controls, Fava et al.  examined the concurrent validity of the unstructured version of the HAM-D21 using 2 self-rating scales, the Symptom Rating Test  and the Symptom Questionnaire (SQ) [176, 177]. They found that there were similar product-moment correlations (ranging from 0.65 to 0.72) between the HAM-D21 and Symptom Rating Test and the SQ . They concluded that low correlations, such as those found by Prusoff et al. , might be due to the specific self-rating scales that were administered rather than to substantial differences between clinician-rated and patient-reported scales . Gottlieb et al.  compared the HAM-D21 with the Zung Self-Rating Depression Scale  and showed that in the sub-group of patients with mild Alzheimer’s disease (AD) there was a statistically significant correlation between the 2 rating scales (r = 0.49). However, in the sub-group of patients with severe AD, there was no correlation between the HAM-D21 and the Zung Scale . The authors concluded that self-reported questionnaires are of questionable clinical utility, particularly in patients with advanced AD . Studies evaluated the concurrent validity of the different versions of the HAM-D using also the BDI [23, 73, 89, 179‒183]. Rehm and O’Hara  found that there was a statistically significant correlation between the total score of the unstructured version of the HAM-D17 and the BDI (r = 0.73). Whisman et al.  found that the DIS-HRSD and the unstructured version of the HAM-D17 exhibited similar correlations to the BDI. Akdemir et al.  tested the concurrent validity of the SIGH-D, the structured interview version of the HAM-D, which was published by Williams . Using the total sample of 94 depressed patients, they showed that there was a significant but moderate (r = 0.48) correlation between the SIGH-D and the BDI . However, when they analyzed only the sub-group of patients with severe depression, they found that this correlation was no longer statistically significant . The authors concluded that the SIGH-D was clinically superior to the BDI, particularly in the assessment of depression severity .
Concurrent Validity of the HAM-D17 Compared to the CID
Authors compared the concurrent validity of the HAM-D17 with that of the CID [184‒186], and moderately high correlations were found for the CID depression score, whereas the CID anxiety score correlated to some extent with the HAM-D17, reflecting the inclusion of several anxiety items in the latter. Correlations of individual items of the HAM-D17 and the CID were also examined , and were high, except for depressed feelings, which in the CID are rated on intensity on questioning, whereas in the HAM-D17 they are rated for the degree to which they dominate the verbal and nonverbal content of the interview, including observed appearance. Highly significant correlations between the HAM-D17 and the CID were found also in another study of patients with major depressive disorder .
Bagby et al.  and Zimmerman et al.  concluded their reviews that it was time to discard the HAM-D and to embrace a new gold standard for the assessment of depression. Considering the HAM-D a flawed measure, they [38, 40] proposed to adopt psychometrically superior rating scales such as the MADRS  and the clinician version of the IDS [106, 134]. The MADRS was specifically developed to be more sensitive to change than the HAM-D . However, the available literature  runs counter this assumption, particularly when the HAM-D6 is used [119, 122, 124]. In fact, it has been shown that the HAM-D6, but not the MADRS, sensitively discriminated between active treatment and placebo [122, 124]. Similar considerations apply also to the clinician version of the IDS . In a clinimetric reanalysis of the STAR*D study, Bech et al.  found that the IDS was a poorly sensitive and multidimensional measure of depression severity.
In their reviews, Bagby et al.  and Zimmerman et al.  criticized other aspects of the HAM-D. They considered the differential item weight as one of the most important limitations of the HAM-D [38, 40]. The evidence that certain HAM-D items contributed more to the total score than others clashed with the psychometric assumption of homogeneity of items [38, 40]. In the psychometric model, to be included in a rating scale, all items have to display the same clinical weight [54, 57, 59, 188]. In the clinimetric approach, however, the differential item weighting is not a disadvantage, but a basic requirement for rating scales [55, 91]. Accordingly, not all items carry the same clinical weight, and major and minor symptoms can be differentiated [54, 55, 59, 91]. Using the HAM-D items, Bech et al.  found that symptoms such as depressed mood, loss of interest, and tiredness occurred in both the mildly and the more severely depressed patients, while symptoms such as guilt feelings and psychomotor retardation were only present in the more severely depressed patients. In other words, they demonstrated the clinical utility of differential weighting of HAM-D items, which can be used to sensitively distinguish major depression from moderate to mild depression . Using the HAM-D, Bech et al.  also showed that symptoms of depressed mood, loss of interest, and tiredness were more prevalent and occurred before the onset of more severe symptoms of depression.
Another criticism raised by Bagby et al.  and Zimmerman et al.  was that the HAM-D items did not cover the DSM diagnostic criteria for major depression. As noted by Zimmerman et al. , such an incomplete coverage of the psychiatric classification criteria for depression significantly limited the utility of the HAM-D as a diagnostic measure.
The introduction of diagnostic criteria for the identification of psychiatric syndromes has considerably decreased the variance due to different assessors and the use of inferential criteria rather than direct observation . The diagnostic criteria are particularly helpful in setting a threshold for conditions worthy of clinical attention. Accordingly, the diagnostic criteria for a major depressive disorder identify a syndrome, which may be responsive to antidepressant drugs. At least 5 of a set of 9 symptoms should be present (and 1 should be either depressed mood or loss of interest). However, according to the psychometric model, all items are weighed the same, unlike in clinical medicine, where major and minor symptoms are often differentiated (e.g., Jones criteria for rheumatic fever). As a result, a patient with severe and pervasive anhedonia, incapacitating fatigue and difficulties concentrating which make him unable to work would not be diagnosed as suffering from a major depressive disorder, despite the clinical intuition of potential benefit from pharmacotherapy. This diagnosis could be performed in a patient who barely meets the criteria for 5 symptoms. The hidden conceptual model is psychometric: severity is determined by the number of symptoms, not by their intensity or quality, to the same extent that a score in a depression self-rating scale depends on the number of symptoms that are scored as positive [54, 59].
Bech [91, 190, 191] also showed that rating scales only covering DSM and ICD diagnostic criteria were found to be poorly sensitive to sub-threshold symptoms of depression. The Hamilton scales, particularly the HAM-D6 and the CID subscales, are not diagnostic instruments but rating scales measuring depression severity [1‒3]. As has been stated by Corruble and Hardy  in this regard: “…the scale should not be compared to DSM-IV criteria because the 2 measures have different objectives; that is, the Hamilton depression scale assesses depression severity in depressed patients, and the DSM-IV defines a diagnosis of major depression.” Nevertheless, such rating scales can be used in combination with current psychiatric classification systems (e.g., DSM-5). Reaching a concordance between diagnostic criteria and the dimensional evaluation of rating scales is the ultimate goal to achieve for providing a comprehensive assessment of the patient’s clinical condition [55, 190, 191]. Therefore, clinician-reported scales, such as the HAM-D6 and the CID, and DSM or ICD diagnostic criteria are actually not conflicting but convergent evaluation methods , which should be used jointly to perform adequate clinical assessment of depression.
Bagby et al.  also criticized the reliability of the HAM-D. Particularly, they concluded that the inter-rater and test-retest reliability coefficients were weak for most of the HAM-D items. Actually, reliability was found to be excellent when the structured versions of the HAM-D were used [7, 22, 84‒86, 89]. These findings are in line with a recent meta-analysis  demonstrating excellent inter-rater reliability for most of the HAM-D individual items.
As noted by Carroll , the endurance of the various versions of the HAM-D is remarkable and their utility in the clinical process of assessment of depression severity has been clearly demonstrated.
The findings of this review indicate that the HAM-D is a valid and sensitive clinimetric index, which should not be discarded in view of obsolete and not clinically relevant psychometric criteria. It is, however, important to note that the various versions of the HAM-D, including the CID, entail different clinimetric properties. Simple reference to the unstructured versions of the HAM-D [2, 3] is no longer acceptable. The choice of the most adequate version depends on a number of clinical factors: raters’ clinical experience and their level of training and expertise in the use of the HAM-D, study aims and design, clinical characteristics of the population under examination. In other terms, there is not a unique version of the HAM-D, which best applies to all clinical situations, but investigators are asked to select the version that best fits with the aims of the study and to report appropriate reference details in the registration and publication of the study protocol. Further, in performing meta-analyses, assembling data that derive from different versions of the HAM-D may add to other variables to yield substantial heterogeneity, a major drawback of the meta-analytic method [194, 195]. Similar considerations may apply to the use of the HAM-D in network analysis [196, 197].
A few indications emerge from the literature. The total score of the HAM-D17, and not that of the HAM-D21, should be used for sensitively discriminating between different levels of depression severity. If the aim of the investigation is to discriminate between active treatment and placebo, the HAM-D6 and the CID should be considered. The CID, in particular, appears to be indicated when differences are expected to be small and/or when dealing with mild or subclinical symptoms [9, 198]. In fact, the use of 7-point Likert scales, as reported by Bech et al. , makes the CID particularly suitable to sensitively capture the relatively milder or subsyndromal symptoms of depression. By contrast, there appears to be no valid justification for using the MADRS, which was found to be less sensitive than the HAM-D6 or as sensitive as the HAM-D17 to changes with treatment [20, 37, 119, 122, 124].
Regardless of the number of items, unstructured versions of the HAM-D [2, 3] should be avoided, while the use of structured versions in clinical trials appears to be mandatory to ensure inter-rater reliability. In view of its clinimetrics properties [9, 174], the CID should be considered as the best structured version of the HAM-D and it can be regarded as the gold standard for the assessment of depression severity. The use of the structured versions of the HAM-D can be supplemented by other indices based on the clinimetric principle of incremental validity [163, 189]. Accordingly, each distinct aspect of psychological measurement should deliver a unique increase in information in order to qualify for inclusion and this should be applied to the selection of instruments in a clinical trial. Several highly redundant scales are often used under the misguided assumption that nothing will be missed. On the contrary, violation of the concept of incremental validity only leads to conflicting results. Structured versions of the HAM-D, in particular the CID, may thus be used in conjunction with self-rating scales for depression and anxiety that are likely to improve incremental validity. Kellner’s SQ, in view of the data that are available and its high sensitivity to change [176, 177], appears to be very suitable for this purpose. Transdiagnostic clinimetric indices that may assess other dimensions, such as psychological well-being and euthymia [163, 199‒201] or mental pain [202, 203], may also add valuable information.
All authors have no conflicts of interest to declare.
There are no funding sources to declare.
All authors conceived the project. D.C. and C.P. performed the searches and collected the data. All authors analyzed the data. All authors drafted and revised the manuscript.