Objective: We aimed to investigate the validity of the Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) through meta-analysis. Study Design: All publications between January 1, 2008 and September 1, 2011 that studied TBSRTC and had available histological follow-up data were retrieved. To calculate the sensitivity, specificity and diagnostic accuracy, the cases diagnosed as follicular neoplasm, suspicious for malignancy and malignant which were histopathologically confirmed as malignant were defined as true-positive. True-negative included benign cases confirmed as benign on histopathology. The nondiagnostic category was excluded from the statistical calculation. The correlations between the 6 diagnostic categories were investigated. Results: The publications review resulted in a case cohort of 25,445 thyroid fine-needle aspirations, 6,362 (25%) of which underwent surgical excision; this group constituted the basis of the study. The sensitivity, specificity and diagnostic accuracy were 97, 50.7 and 68.8%, respectively. The positive predictive value and negative predictive value were 55.9 and 96.3%, respectively. The rates of false negatives and false positives were low: 3 and 0.5%, respectively. Conclusions: The results of meta-analysis showed high overall accuracy, indicating that TBSRTC represents a reliable and valid reporting system for thyroid cytology.
Thyroid nodules are common and can be detected by ultrasound in up to 60% of the general population [1,2,3]. Data from the Surveillance Epidemiology and End Results (SEER) registry show an increasing prevalence of differentiated thyroid cancer worldwide [4,5], most likely due to the increased detection of small papillary carcinomas . For the initial evaluation of patients with such nodules, thyroid fine-needle aspiration (FNA) has proven to be a rapid, cost-effective, safe and reliable method of investigation. Moreover, this procedure is the most appropriate diagnostic tool to distinguish between patients that require clinical management or surgical excision. The increasing prevalence of thyroid cancer and improvements in the technology and resolution of ultrasound machines haveled to an increasing number of cytological diagnostic procedures .
In 2007, a conference with one of the objectives being to standardize the diagnostic terminology for the reporting of thyroid cytopathology results was held in Bethesda, Md. The recommendations resulting from this conference led to the formation of The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC). This classification scheme has achieved its purpose of standardization of thyroid-reporting cytopathology, as evidenced by several publications . By the same token, the category of follicular lesion/atypia of undetermined significance remains the most controversial category due to heterogeneity in its use among institutions and follow-up. This is due to the fact that it is nearly impossible to establish distinct morphologic criteria for diagnosing atypia among cytopathologists [9,10,11,12,13,14,15,16]. It is has been shown that the best way to implement TBSRTC in a pathology laboratory and to convince the clinicians of its validity and robustness is to compare the outcome data prior to and after the implementation of TBSRTC. A comparison of the outcomes obtained by several institutions showed notable differences in the percentage of patients referred to surgery and the risk of malignancy for different diagnostic categories (DCs).
So the question one must ask now after 4 years of TBSRTC is whether this classification scheme has served its purpose of refining the cytopathologic interpretation of thyroid FNA and the management of thyroid nodules. In this investigation, we performed a comprehensive meta-analysis of thyroid FNA studies conducted using TBSRTC, to assess the variability in the use of the 6 DCs between different institutions with the aim of investigating the validity of this reporting system in view of histological outcomes.
Materials and Methods
The primary source of the reviewed publications was the PubMed database (http://preview.ncbi.nlm.nih.gov/pubmed) and manual searches of the reference list included in the selected studies. The search consisted of literature published from January 2008 to September 2011. The inclusion criteria for the meta-analysis were publications written in the English language which classified thyroid FNA specimens according to TBSRTC and included surgical follow-up. Eight studies met the inclusion criteria and were utilized (table 1) [17,18,19,20,21,22,23,24]. The data collected from the University of Pennsylvania Medical Center, Philadelphia and the Massachusetts General Hospital, Boston, have already been reported in a different study  and have been split for this meta-analysis. The data points for all of the DCs and histopathologic follow-up were extracted and entered into an Excel datasheet for statistical analysis.
For the purpose of this report, the 6 DCs used in the TBSRTC were defined as: DC I = nondiagnostic, DC II = benign, DC III = atypia/follicular lesion of undetermined significance (AUS/FLUS), DC IV = follicular neoplasm/suspicion for a follicular neoplasm (FN/SFN), DC V = suspicious for malignancy and DC VI = malignant (table 2) [8,25,26].
The follow-up data included only reported histopathologic follow-up. Any incidentally detected lesions in the surgical resection specimens that were not the target of the thyroid FNA were excluded from the statistical analysis. One author included microcarcinomas in the histopathological follow-up .
Statistical analysis was performed on the final Excel spreadsheet containing the data extracted from each publication employing both a log-linear model (likelihood ratio) and a χ2 model with symmetric measures of association. Two-sided p values of <0.05 were considered statistically significant. The phi correlation was calculated to measure the strength of the association between categorical data; values ranged from +1 to –1, with +1 denoting a strong positive association, –1 indicating a strong negative association and 0 demonstrating no association.
The sensitivity, specificity and diagnostic accuracy were calculated considering thyroid FNA as a ‘screening test’; FNA specimens interpreted as benign (DC II) were considered to be true-negative samples and the remaining categories (DC IV, DC V and DC VI) were considered to be true-positive samples because they led to a recommendation of surgery. The false-positive category included cases that were diagnosed as follicular neoplasm, suspicious for malignancy and malignant – that were confirmed as benign. The false-negative cases included those diagnosed as benign on FNA but confirmed as malignant upon surgical excision.
The use of the term ‘positive’ is for statistical purposes only and does not indicate ‘malignant.’ The DC I category was excluded from the statistical analysis because these diagnoses usually led to a repeat FNA rather than to surgical excision. Considering that almost 40% of cases in the AUS/FLUS category underwent surgery (see results, table 3), we performed an additional statistical analysis including the AUS/FLUS category as true-positive samples in the calculation of sensitivity, PPV and diagnostic accuracy.
SAS statistical software (version 9.1; SAS Institute Inc., Cary, N.C., USA) was used for the statistical analyses.
The 8 published studies used for our meta-analysis together with the percentage of FNA cases placed into each DC from all publications included are listed in table 1. The 6 DCs used in the TBSRTC are described in table 2.
A total of 6,362 (25%) of the 25,445 thyroid FNAs included in the meta-analysis were followed by surgery, and the cytohistological correlations from these cases are presented in table 3. For each of the 6 individual DCs in TBSRTC, we observed the following:
DC I: FNAs placed in this category ranged from 1.8%  to 23.6%  with an overall value of 12.9%, when considering all the studies in the meta-analysis (table 3). Surgical resection was performed in 16.2% of cases, and the risk of malignancy was 16.8%.
DC II: The cases in this category ranged from 39%  to 73.8%  with an overall value of 59.3%, and a cumulative malignancy rate of 3.7%.
DC III: FNA cases in this category ranged from 3%  to 27.2%  with an overall value of 9.6% and an overall rate of malignancy of 15.9%.
DC IV: Cases in this category ranged from 1.2%  to 25.3%  with an overall value of 10.1%. More than two-thirds of these (approx. 70%) underwent surgery, with a risk of malignancy of 26.1%.
DC V: FNA cases in this category ranged from 1.4%  to 6.3%  with an overall value of 2.7%. The mean risk of malignancy in this category was 75.2%.
DC VI: The reported rate of malignancy in all publications included in this analysis ranged from 2%  to 16.2  % with an overall value of 5.4% and a risk of malignancy of 98.6%.
There was a robust correlation between the DCs and the histological follow-up, except between the DCs AUS/FLUS and nondiagnostic (table 4).
The main diagnostic indicators in the TBRSTC are presented in table 5. The negative predictive value (NPV) and PPV for the benign and malignant categories were 96.3 and 98.6%, respectively. The sensitivity, specificity and diagnostic accuracy were 97, 50.7 and 68.8%, respectively with a false-negative rate of 3% and false-positive rate of 0.5%. The sensitivity and diagnostic accuracy were 97.2 and 60.2%, respectively, when considering the AUS/FLUS category as true-positive samples.
As evidenced by the extensive body of literature, a tiered diagnostic classification scheme, such as the 6-tiered one proposed by TBSRTC or those proposed by other associations, is an effective approach for the diagnosis and management of thyroid nodules [8,27,28]. The advantage of TBSRTC is the standardization of the reporting of thyroid cytology, which, prior to 2007, consisted of nonreproducible classification schemes that in some cases included either too few or too many DCs. The 6 DCs of TBSRTC arose from a probabilistic approach: the probability that a thyroid lesion placed into a specific category would show histological evidence of malignancy . The advantage of this approach is that each of the 6 DCs can be associated with an implied risk of malignancy that translates into a recommendation for clinical management. Since its inception, many authors have published their experience with TBSRTC; therefore, we believe it is time to take a closer look at the combined clinical applicability of this classification scheme.
We approached this meta-analysis with the following questions. How has TBSRTC impacted the clinical management of thyroid nodules, and how has the variability in reporting among different institutions as well as the average risk of malignancy for each category changed since the implementation of TBSRTC? We acknowledge that the comparison between pre- and post-TBSRTC has significant limitations; however, the utility of the information gained from the analysis as presented here has a practical value in reflecting the current state of TBSRTC and its validity/applicability in various practices.
The percentage of cases that underwent surgery varied greatly between different institutions, ranging from 11.8%  to 45.1%  with an average rate of 25% (6,362 out of 25,445). These variable rates could be explained by the different follow-up times in the studies we used, rather than being considered as true differences in the surgical follow-up. More than two-thirds of the cases for which the TBSRTC indicated surgical management (a DC ≥FN/SFN) were resected. Interestingly, the rate of surgery was almost identical (i.e. 69.7, 73.7 and 74.0%) in the FN/SFN, suspicious for malignancy and malignant categories, respectively.
The validity of using 6 DCs is justified by the strong correlation between each DC and the histological outcomes in predicting benignity versus malignancy. The only nonstatistically significant difference was observed between the nondiagnostic specimens and the AUS/FLUS DC. Interestingly, both AUS/FLUS and non-diagnostic cases had similar risks of malignancy, 16.8 and 15.9%, respectively.
The average percentage of cases in the non-DC was 12.9%, ranging from 1.8 to 23.6%. This shows that even when the criteria for a satisfactory specimen are so-called ‘well established’ for solid or complex nodules, there exists a noticeable variability among institutions in classifying thyroid FNA cases as nondiagnostic. Thus, the range of nondiagnostic cases among different institutions is still as high as it was in the pre-TBSRTC era . One explanation for this high nondiagnostic rate, previously suggested by Renshaw ,is that published studies such as those used in this meta-analysis are primarily based upon the experience of academic/large reference centers, and individual thyroid FNA cases sent to these centers for aspiration may represent cases that are technically challenging and where it is difficult to procure an adequate specimen.
If we calculate the rate of positive cases as the number of cases diagnosed as malignant divided by all of the cases aspirated (i.e. 1,378/25,445), we obtain a malignancy rate of 5.4% that falls exactly in line with the inverse correlation of the nondiagnostic rate and positive rate described by Renshaw , thus supporting the findings of our study. Up to 16.2% of cases classified as nondiagnostic underwent surgical excision with a malignancy rate of approximately 17%. This is much higher than proposed by the TBSRTC (1–4%). Interestingly, the malignancy rate of the non-DC is even slightly more elevated than the one observed for the AUS/FLUS category, even though more than twice the number of cases classified as AUS/FLUS underwent surgical excision. It is not known whether all of the thyroid FNA cases that underwent surgery were classified as nondiagnostic on the initial or the repeat FNA. This finding from the meta-analysis suggests that a nondiagnostic aspirate obtained by an experienced operator from a sonographically suspicious nodule should be managed cautiously due to the appreciable risk of malignancy [31,32].
The percentage of cases classified as benign showed a malignancy rate of 3.7% which is slightly higher than that recommended by TBSRTC guidelines (0–3%); however, it is within the 0–5% range reported by the American Thyroid Association guidelines . The studies by Theoharis et al.  and Her-Juing Wu et al.  showed the widest variation in the percentage of benign cases (73.8 vs. 39%). Interestingly, a similar wide variation was also observed for the AUS/FLUS category in these reports (3 vs. 27.2%). This variability could be explained by differences in patient population, nodule selection criteria or subjective classification issues.
The overall percentage of AUS/FLUS cases in this analysis was 9.6%, with a 15.9% risk of malignancy. As evidenced by many pre- and post-TBSRTC publications on this topic, the cases classified as AUS/FLUS represent the ‘gray zone’ in thyroid cytopathology for many reasons. Some authors have recommended a further subdivision of the AUS/FLUS category to better reflect the malignancy risk . In our meta-analysis, 39.2% of cases in this category underwent surgical resection, although we were unable to determine whether the surgery was performed after the first or repeat FNA diagnosis of AUS/FLUS. Because of this high percentage of AUS/FLUS cases undergoing surgery, we might ask if this category is perceived as a ‘positive test’, more than ‘indeterminate results’ that require repeated FNA, according to TBSRTC. Considering the AUS/FLUS category as a true positive sample in the statistical analysis, the sensitivity of thyroid FNA has not changed much, from 97 to 97.2% (table 5). In contrast, the PPV value and diagnostic accuracy have fallen down slightly from 55.9 to 46.9% and from 68.8 to 60.2%, respectively (table 5). The question arises how one can prevent the overuse of the AUS/FLUS category in a clinical practice. Krane et al.  proposed using the AUS/FLUS:malignant ratio as a performance measure for the reporting of thyroid FNA in TBSTRC, similar to the ASCUS:LSIL (atypical squamous cells of undetermined significance:low-grade squamous intraepithelial lesion) ratio used in gynecological cytology. These authors also recommended that this ratio should be between 1 and 3. In our analysis the AUS/FLUS:malignant ratio was 1.8%. It has been shown that adjunct molecular studies might be beneficial in further triaging of these cases. At present, these molecular tests, which consist of gene-mutation panels or RNA analysis, are being offered at both institutional and commercial levels with either a high NPV or PPV [34,35].
The FN/SFN DC also showed a very wide range of 1.2 to 25.3% (mean 10.1%) with a risk of malignancy of 26.1%. Interestingly, this finding is compatible with rates given for thyroid FNAs prior to TBSRTC. Approximately 70% of patients in this category underwent surgical excision of the thyroid nodule, attesting to the implications of the use of the term ‘neoplasm’. The mean risk of malignancy for cases classified as suspicious for malignancy and malignant was 75.2 and 98.6%, respectively. These rates are also consistent with those presented in pre-TBSRTC publications.
As evidenced by its high sensitivity and high NPV, TBSRTC has proven to be an effective and robust thyroid FNA classification scheme to guide the clinical management of patients with thyroid nodules. The findings of this analysis show a growing trend i.e. that many institutions both nationally and worldwide are adopting this reporting system in order to provide their clinicians with cytopathology reports that are clear and comprehensible and permitting comparisons and performance evaluations on a larger scale.