Abstract
Introduction: Clinical notes in electronic health records offer valuable insight into the symptom profiles and trajectories of patients with severe mental illness (SMI). However, systematically extracting symptoms at scale remains a challenge, especially in languages other than English. We developed a light, accurate, and interpretable natural language processing (NLP) algorithm to extract psychiatric phenotypes from Spanish clinical notes. Methods: We selected a set of 136 core psychiatric phenotypes and annotated 4,000 clinical note sections (e.g., Chief Complaint, Plan; called “documents”) and 240 complete visit notes (called “entries”) from two psychiatric hospitals in Colombia: Hospital Mental de Antioquia (HOMO) and Clínica San Juan de Dios Manizales (CSJDM). For phenotypes meeting frequency and inter-annotator reliability thresholds, we developed three NLP algorithms (HOMO, CSJDM, and COMBINED) for phenotype extraction and context labeling (e.g., negation, family history, uncertainty). We evaluated performance at the document and entry levels, as well as across hospitals. Results: Document-level performance at both hospitals was high (average F1 scores of 0.84 and 0.85). Moreover, on phenotypes meeting our document-level performance threshold of F1 ≥0.7, entry-level performance was high as well (average F1 of 0.75 and 0.78), as was the cross-hospital transportability of the algorithms (F1 of 0.75 HOMO-to-CSJDM and 0.77 CSJDM-to-HOMO). The COMBINED algorithm improved overall recall, without significantly decreasing precision (F1 of 0.78 and 0.77 on HOMO and CSJDM, respectively). The application of our algorithm for 50 high-performing phenotypes to the notes of 9,737 SMI patients highlighted the transdiagnostic nature of many core SMI phenotypes; 44/50 phenotypes were recorded in over 10% of patients across diagnoses. Multiple correspondence analysis further revealed variation in symptom space across diagnoses; while major depressive disorder and schizophrenia form distinct clusters, patients with bipolar disorder span the entire phenotypic spectrum. Conclusion: Our tool enables the systematic investigation of psychiatric symptoms from psychiatric notes, facilitating large-scale investigations in Spanish-speaking populations.
Introduction
Clinical notes within electronic health records (EHRs) capture detailed descriptions of patients’ symptoms, behaviors, treatment responses, and other information unavailable in structured format [1‒4]. In the absence of biomarkers for severe mental illness (SMI), a patient’s condition is most accurately described in free-text narratives, making these essential for diagnosis and treatment [5], and a valuable resource for biomedical research. Despite the potential for these clinical notes to serve as low-cost resources for deep, high-throughput phenotyping [6‒11], the availability of such tools remains limited, especially for non-English applications.
Recent studies have demonstrated the potential of clinical natural language processing (cNLP) to advance psychiatric research by enabling uniform phenotyping irrespective of conventional diagnostic categories, aligning with initiatives like the Research Domain Criteria [12] or the Hierarchical Taxonomy of Psychopathology [13], which promote transdiagnostic approaches to phenotyping. This alignment between computational methods and theoretical frameworks may promote the discovery of biological mechanisms or treatment targets that transcend traditional diagnostic boundaries. For example, McCoy et al. [14] developed a cNLP method to extract five Research Domain Criteria domains from clinical notes and further identified genetic associations with these phenotypes within an EHR-linked biobank in the USA. Similarly, Jackson et al. [15] developed the Clinical Record Interactive Search (CRIS) NLP system to extract transdiagnostic symptoms of SMI within the South London and Maudsley Hospital in the UK, a tool used for early detection of individuals at high risk of psychosis. Additionally, cNLP tools are being developed to improve patient care, with recent studies focusing on detecting risk for psychosis [16], predicting suicidal behavior and suicide attempts [4, 17, 18], and identifying mental health crises [19]. Despite these advancements, the development of cNLP tools for diverse global contexts remains limited.
Most cNLP tools are developed in high-resource, English-speaking environments and are validated using exclusively English-language clinical notes. While Spanish-language cNLP tools have been developed to extract phenotypes in certain clinical domains [20‒22], Spanish-language tools for the extraction of psychiatric phenotypes from EHR databases remain scarce, despite Spanish being the second most common native language in the world. Broad adoption of cNLP tools is not only hampered by language differences; even within the same language and country, differences across healthcare systems and cultural contexts can limit their transferability [23‒26]. This issue, while commonly acknowledged, is rarely addressed through multisite evaluations [23, 27, 28], partly because robust cNLP tools require substantial resources to be developed, validated, and deployed [29, 30]. While recent cNLP advances have favored transformer-based approaches for psychiatric phenotype extraction from English text [31, 32], pattern-based NLP remains a practical and effective alternative in many contexts. Pattern-based approaches enable the processing of millions of EHR documents with minimal infrastructure and without the need for specialized hardware (i.e., graphics processing units), which are largely unavailable in resource-constrained settings. Additionally, rule-based systems produce transparent, easily auditable outputs that are particularly valuable in clinical environments. Our aim with this study was to develop an accessible and scalable cNLP tool to be adopted in low-to-medium resource, Spanish-speaking settings, thereby accelerating adoption of EHRs in global mental health research.
Here, we describe the development of a cNLP tool to comprehensively extract psychiatric phenotypes from Spanish clinical notes and apply our algorithm to a cohort of over 9,000 SMI patients, analyzing symptom prevalence across diagnostic categories and demonstrating its potential to enable large-scale psychiatric research. Our approach relies on rule-based techniques [33, 34] to create a fast, light, accurate, and interpretable system for phenotype extraction, which can be realistically applied to full EHR databases comprising millions of notes without requiring the use of graphics processing units. We defined a comprehensive list of phenotypes through complementary strategies, including literature-based and data-driven ones, as well as through a hierarchical ontology in which “narrow” phenotypes aggregate into higher-order, “broad,” ones. To the best of our knowledge, this is the first tool that enables the accurate, rapid, and comprehensive extraction of psychiatric phenotypes from Spanish clinical text. We evaluated the transferability of our approach by developing and validating our algorithms across two psychiatric medical centers in Colombia: the Hospital Mental de Antioquia (HOMO) and the Clínica San Juan de Dios Manizales (CSJDM).
Methods
Our study has five sections (Fig. 1): (1) selection of phenotypes to extract, (2) phenotype annotation, which includes identifying phenotype-rich sections from the clinical notes (hereafter referred to as “documents”), selecting patients for review of complete clinical notes (hereafter referred to as “entries”), independent annotation of documents, selection of document test sets, and consensus annotation of the test sets (documents and entries), (3) algorithm development from rule-based patterns, (4) algorithm evaluation – we evaluated our algorithm across three levels: at the document and entry levels, and cross-hospital evaluation, and (5) application to phenotyping of a large SMI cohort. All analyses were done in Python (v3.10.8) and R (v4.2.2).
Schematic depiction of the development and evaluation of the NLP algorithms. a Document selection: 2,000 documents and 120 full notes (“entries”) corresponding to SMI patients were selected from the EHR database of each hospital. b Development set documents were independently annotated by two clinicians; for the test sets, a consensus annotation was generated for a subset of the documents (n = 309 and n = 358 in HOMO and CSJDM, respectively) and the 120 entries. c The annotations on the development set documents (n = 1,691 and n = 1,642) were used to manually generate patterns for MedSpaCy. d The algorithms were tested on the consensus annotation of documents and entries. Three sets of patterns were used for entry-level testing, those coming from the annotations of the same hospital, those from the other hospital, and a combination of both (“COMBINED”). e Phenotype filtering through annotation and evaluation steps in HOMO (green) and CSJDM (blue).
Schematic depiction of the development and evaluation of the NLP algorithms. a Document selection: 2,000 documents and 120 full notes (“entries”) corresponding to SMI patients were selected from the EHR database of each hospital. b Development set documents were independently annotated by two clinicians; for the test sets, a consensus annotation was generated for a subset of the documents (n = 309 and n = 358 in HOMO and CSJDM, respectively) and the 120 entries. c The annotations on the development set documents (n = 1,691 and n = 1,642) were used to manually generate patterns for MedSpaCy. d The algorithms were tested on the consensus annotation of documents and entries. Three sets of patterns were used for entry-level testing, those coming from the annotations of the same hospital, those from the other hospital, and a combination of both (“COMBINED”). e Phenotype filtering through annotation and evaluation steps in HOMO (green) and CSJDM (blue).
Setting
We accessed clinical notes from two psychiatric hospitals in Colombia: HOMO (370 beds), the only public psychiatric hospital in the department of Antioquia (population 7 million), and CSJDM (220 beds) the primary mental health clinic for the department of Caldas (population 1 million) – which treats patients regardless of insurance status. As part of three genetic studies (Paisa Project, Misión Origen, and PUMAS) [35‒37], our team has enrolled patients at HOMO since 2020 and at CSJDM since 2017. Eligible participants were at least 18 years of age and had a diagnosis of SMI (online suppl. note 1; for all online suppl. material, see https://doi.org/10.1159/000546480). For this study, we accessed the records of consenting participants at HOMO spanning 2013–2022 and a de-identified version of the entire EHR database at CSJDM spanning 2016–2022. The composition of clinical notes from both hospitals is described in online supplementary note 2.
Selection of Phenotypes
To compile 136 candidate phenotypes to extract from clinical notes (online suppl. Table 1), we first translated 54 phenotypes from the CRIS NLP Applications Library [15]. Then, we identified 28 phenotypes through the review of 20 random CSJDM entries by Colombia- and US-based psychiatrists and clinical researchers. An additional 46 phenotypes were identified during the annotation of the first 100 documents. Finally, the phenotype hierarchy introduced eight additional parent phenotypes (e.g., “psychomotor alterations” as a parent of “psychomotor delay,” online suppl. Table 2). All but these eight parent phenotypes (i.e., 128 phenotypes) were directly annotated in the annotation stage.
Phenotype Annotation
Selection of Documents. Clinical notes at both hospitals are divided into distinct fields (online suppl. note 2), including physical and mental status exams, and “SOAP” components: subjective, objective, assessment, and plan. To increase the likelihood of identifying phenotypes for annotation, we focused on the five fields with the highest concentration of phenotypes as identified using regular expression matching (online suppl. note 3). We randomly selected 400 documents (30 characters or longer) from each field, resulting in a total of 2,000 documents per hospital – corresponding to 1,828 patients in HOMO and 1,821 in CSJDM (Fig. 1a).
Independent Annotation of Documents. Two expert research clinicians independently annotated 128 phenotypes in 2,000 documents from each clinic using WebAnno 3.6.5 [38, 39] (Fig. 1b). To maximize alignment between annotators, the first 20 documents were reviewed and discussed jointly to resolve phenotyping ambiguities. Additional reviews were done at the completion of 50, 100, and 500 document annotations. MedSpaCy’s English ConText algorithm [40] uses cue phrases to detect instances of negation, uncertainty, hypothetical scenarios, and historical or third-party references. We annotated corresponding cue phrases for these categories to adapt this algorithm for Spanish.
Selection of Documents for Test Set. We split the 2,000 annotated documents from each hospital into development and test sets. Since a random split would likely result in an imbalance of low-frequency phenotypes, we ensured an adequate balance by requiring that at least 50% of the instances of each phenotype appear in the development set and at least 20% appear in the test set (online suppl. note 4). This resulted in 1,691 (development) and 309 (test) documents at HOMO, and 1,642 and 358 documents at CSJDM (Fig. 1b).
Selection of Patients for Chart Review (Entries). To evaluate our NLP framework at the entry level (i.e., complete notes), we randomly selected 120 patients from each hospital (240 total) and 40 from each of the three diagnostic categories: major depressive disorder (MDD), bipolar disorder (BD), and schizophrenia (SCZ; Fig. 1a). Diagnostic labels were obtained from the EHR as the primary diagnosis at recruitment. To capture heightened symptom severity, entries were prioritized in the following order: (1) the most recent hospital intake note, (2) most recent emergency department note, or (3) most recent outpatient note.
Independent Annotation of Patient Entries. Two annotators independently identified and annotated phenotypes in each entry. Furthermore, to evaluate the ConText cues, we categorized phenotypes as “non-asserted” if they were negated, uncertain, hypothetical, historical, or referred to someone other than the patient, and “asserted” otherwise.
Consensus Annotation of Test Sets (Documents and Entries). To ensure a consistent and accurate baseline for evaluation, we performed consensus annotations on the document- and entry-level test sets, where annotators jointly resolved any discrepancies.
Support and Inter-Annotator Agreement (IAA). For each phenotype in the development set, we calculated Cohen’s kappa [41] and support (union of instances identified by either annotator). Parent phenotypes included all hierarchical descendants. Pattern development was only done for phenotypes with support ≥10 and kappa ≥0.6.
Algorithm Development
Our NLP pipeline was built using spaCy (v3.5.4) [42] and its clinical extension, medSpaCy (v1.1.2) [33]. spaCy combines machine-learning and pattern-based methods for constructing high-performance NLP pipelines, while medSpaCy provides components for cNLP operations, including clinical sentence tokenization (using the RuSH algorithm [38]), named entity recognition (NER), and context analysis (using the ConText algorithm [40]). The heart of our algorithm lies in its NER patterns. We developed these manually based on the development sets from each hospital, resulting in two distinct algorithms (“HOMO” and “CSJDM”; Fig. 1c; online suppl. Table 3). Each NER pattern represents a concept’s syntactic or phrasal variant, accounting for lexical and morphological variations, verb conjugations, lemma derivations, and common typos. We also generated a common set of RuSH patterns (sentence tokenizers) based on medSpaCy’s defaults and Spanish ConText patterns derived from the annotated cues.
Algorithm Evaluation
Document-Level Evaluation. We evaluated each algorithm’s phenotype recognition performance on its respective test set (Fig. 1d), excluding phenotypes with fewer than three instances in the test set and testing them individually to prevent ambiguity due to overlapping spans. For each phenotype, we measured precision, recall, and F1 score, and analyzed the correlation between these performance metrics and factors like support and IAA. High-performing phenotypes (F1 ≥0.70) were further evaluated at the entry level (Fig. 1e).
Patient Entry-Level Evaluation. We assessed each algorithm’s ability to detect asserted phenotypes within full patient entries – evaluating both the NER and ConText components together. As above, each algorithm was tested on its respective test set, excluding phenotypes with fewer than three instances in the test set, and testing them individually.
Cross-Hospital Evaluation. We assessed our algorithms’ transportability between hospitals through cross-testing and combined testing (Fig. 1d). For cross-testing, we evaluated HOMO’s algorithm on CSJDM’s entry-level test set (HOMO-to-CSJDM) and vice versa (CSJDM-to-HOMO). For combined testing, we merged the NER patterns from both hospitals into a single algorithm (hereafter referred to as “COMBINED”) and evaluated it on each entry-level test set.
Application to Phenotyping
Application to EHR of SMI Patients. We applied the COMBINED algorithm to intake and progress notes of SMI patients at CSJDM, recorded between 2016 and 2022 (online suppl. note 5). Individuals were labeled as MDD, BD, or SCZ, based on their most recent SMI ICD-10 code. We included only phenotypes achieving F1 ≥0.70 in the evaluation of COMBINED-to-CSJDM, though different thresholds could be used depending on the analysis goals. We analyzed patients with two or more intake or progress notes [34], defining phenotypes as present if mentioned at least once as asserted in any note. Using logistic regression, we tested for associations between phenotypes and diagnoses (online suppl. note 6) after accounting for note count and adjusting p values for multiple comparisons (Bonferroni) [34].
Phenotype Analysis of SMI Patients. We used multiple correspondence analysis (MCA) [43] to investigate phenotype co-occurrence within and between SMI diagnoses. This projected the patients’ phenotypic profiles onto a lower dimensional space for visualization. For clarity of presentation, we randomly selected 1,000 patients from each diagnostic category for plotting.
Results
Annotation and IAA
Our annotation of 136 phenotypes across 4,000 documents yielded 11,900 unique mentions at HOMO and 15,862 at CSJDM (the full set of annotations can be found in online suppl. Table 3, and observations regarding the consensus results in online suppl. note 7). After excluding phenotypes with insufficient support (44 at HOMO and 40 at CSJDM) or low reliability (eight in each hospital), we retained 84 phenotypes at HOMO and 88 at CSJDM for NER pattern development. These phenotypes showed high IAA, with average kappa values of 0.88 (HOMO) and 0.87 (CSJDM).
Generation of Patterns for NER, Sentence Tokenizer, and ConText
We developed 601 NER patterns for 84 phenotypes at HOMO and 695 patterns for 88 phenotypes at CSJDM (online suppl. Fig. 2), optimizing precision and recall on the development set (Fig. 2; online suppl. Fig. 3). Recall in the test set improved with increasing size of the development set, typically plateauing after the annotation of 500–1,000 documents (Fig. 3; online suppl. Fig. 4). Additionally, we created 56 RuSH patterns for sentence tokenization and 103 ConText patterns (24 negation, 27 uncertainty, 18 hypothetical, 24 historical, and 9 third-party references). All patterns are available at https://github.com/clarafrydman/Spanish_Psych_Phenotyping for broad use and further development.
Process of deriving phenotype-specific patterns from annotations for two phenotypes, referential thinking (a) and apathy (b), in the CSJDM algorithm, alongside saturation curves showing the changes in recall and precision in the development set for every new pattern (e.g., “referential” pattern) added to a concept (i.e., “referential thinking”). Some patterns included both LEMMA matching (base word forms, e.g., running → run) and FUZZY matching (allowing for minor spelling variations). Terms shown in Spanish are translated as follows: ideas delirantes de referencialidad (“delusions of reference”); apatía (“apathy”); todo me da igual (“everything is the same to me,” or “I don’t care about anything”); no me importa la vida (“I don’t care about life”); pérdida del interés (“loss of interest”).
Process of deriving phenotype-specific patterns from annotations for two phenotypes, referential thinking (a) and apathy (b), in the CSJDM algorithm, alongside saturation curves showing the changes in recall and precision in the development set for every new pattern (e.g., “referential” pattern) added to a concept (i.e., “referential thinking”). Some patterns included both LEMMA matching (base word forms, e.g., running → run) and FUZZY matching (allowing for minor spelling variations). Terms shown in Spanish are translated as follows: ideas delirantes de referencialidad (“delusions of reference”); apatía (“apathy”); todo me da igual (“everything is the same to me,” or “I don’t care about anything”); no me importa la vida (“I don’t care about life”); pérdida del interés (“loss of interest”).
Saturation curves showing change in recall for the phenotypes referential thinking (a) and apathy (b) at the document-level evaluation in CSJDM, as a function of number of documents annotated as the development set. Solid line and shaded area correspond to mean and 95% CI, respectively, of five random samples.
Saturation curves showing change in recall for the phenotypes referential thinking (a) and apathy (b) at the document-level evaluation in CSJDM, as a function of number of documents annotated as the development set. Solid line and shaded area correspond to mean and 95% CI, respectively, of five random samples.
Evaluation of the Algorithms
Our NER patterns achieved high document-level performance. Phenotypes at HOMO (test set n = 309) had mean precision/recall/F1 of 0.90/0.82/0.84, while phenotypes at CSJDM (test set n = 358) scored 0.91/0.82/0.85 (online suppl. Table 5). In both settings, phenotype performance (F1) was positively correlated with frequency in the test set (HOMO: r = 0.33, p = 0.002; CSJDM: r = 0.25, p = 0.016), IAA (r = 0.40, p = 0.0003; r = 0.54, p = 4.93e−8), and annotation-to-pattern ratio (r = 0.31, p = 0.005; r = 0.45, p = 1.02e−5) (online suppl. Fig. 5), indicating better performance for common phenotypes with consistent phrasing. After excluding 14 and 15 phenotypes with F1 <0.70, we retained 70 and 73 phenotypes at HOMO and CSJDM, respectively, for reliable extraction from documents. From these, 56 were shared between the two hospitals (online suppl. Table 5; Fig. 1e).
At the entry level, both algorithms maintained strong performance when combining NER and ConText patterns. After removing phenotypes with insufficient support across 120 entries, we evaluated 69 phenotypes at HOMO and 70 at CSJDM. HOMO achieved mean precision/recall/F1 of 0.86/0.71/0.75, while CSJDM scored 0.88/0.74/0.78 (online suppl. Table 6). This within-hospital evaluation identified 50 high-performing phenotypes in each hospital (F1 ≥0.70), of which 30 were shared across hospitals (Fig. 1e). Some phenotypes showed marked cross-hospital differences – for example, “low energy” achieved F1 scores of 0.95 at CSJDM but 0.44 at HOMO, while for “thought disorder” F1 was 0.78 at HOMO but 0.49 at CSJDM. Online supplementary Table 9 and online supplementary Figures 6 and 7 detail each phenotype’s status at each evaluation stage.
Cross-Hospital Evaluation
Cross-hospital evaluation demonstrated high portability of all algorithms, with most phenotypes (70–76%) maintaining high performance (Fig. 4; online suppl. Fig. 8; online suppl. Table 6). Remarkably, all were highly portable: on the HOMO entries (n = 120), the mean F1 was 0.77 (CSJDM; 69 phenotypes) and 0.78 (COMBINED; 69 phenotypes). On the CSJDM entries (n = 120), mean F1 was 0.75 (HOMO; 64 phenotypes) and 0.77 (COMBINED; 70 phenotypes). The COMBINED algorithm outperformed hospital-specific algorithms on their native test sets: surpassing HOMO in 35/69 phenotypes and CSJDM in 17/70 phenotypes. Furthermore, the COMBINED algorithm improved recall by >0.1 for 22 phenotypes at HOMO and 8 at CSJDM, likely due to these phenotypes’ lower support in their development sets. These results demonstrate both successful cross-hospital transferability and the superior performance of the COMBINED algorithm (online suppl. Table 6).
Support and F1 scores of each phenotype evaluated for all algorithms on the CSJDM entry-level test set (n = 120). a F1 scores for the CSJDM (dark blue), HOMO (green), and COMBINED (light blue) algorithms, on the CSJDM test set. Only the 53 phenotypes passing upstream quality criteria in CSJDM and HOMO are displayed: a minimum of ten annotations in the HOMO and CSJDM development sets as well as kappa values of at least 0.60, an F1 score of at least 0.70 for the document-level evaluation in both CSJDM and HOMO, and support ≥3 in the entry-level CSJDM test set. Phenotypes are ordered by support levels in the CSJDM test set. b Support for each phenotype in the CSJDM entry-level test set (blue, left axis, n = 120) and development set (orange, right axis, n = 1,642;).
Support and F1 scores of each phenotype evaluated for all algorithms on the CSJDM entry-level test set (n = 120). a F1 scores for the CSJDM (dark blue), HOMO (green), and COMBINED (light blue) algorithms, on the CSJDM test set. Only the 53 phenotypes passing upstream quality criteria in CSJDM and HOMO are displayed: a minimum of ten annotations in the HOMO and CSJDM development sets as well as kappa values of at least 0.60, an F1 score of at least 0.70 for the document-level evaluation in both CSJDM and HOMO, and support ≥3 in the entry-level CSJDM test set. Phenotypes are ordered by support levels in the CSJDM test set. b Support for each phenotype in the CSJDM entry-level test set (blue, left axis, n = 120) and development set (orange, right axis, n = 1,642;).
Eleven phenotypes consistently achieved very high performance (F1 ≥0.85) across algorithms and test sets, including “delusions,” “anxiety,” “worthlessness,” “hallucinatory behavior,” and “substance use.” In contrast, “poor insight” and “compromised judgment” consistently performed poorly (F1 ≤0.22 and ≤0.16, respectively) despite having high support (66–89; Fig. 4; online suppl. Fig. 8).
Application of the NLP Algorithm to the SMI Population
We used the COMBINED algorithm to analyze clinical notes of 9,737 SMI patients at CSJDM (4,618 MDD, 3,656 BD, 1,463 SCZ) with at least two intake or progress notes between 2016 and 2022 – comprising 25,599 intake and 315,190 progress notes. Of 50 phenotypes that achieved F1 ≥0.70 in the COMBINED-to-CSJDM evaluation (Fig. 5), most proved transdiagnostic: 44 were recorded in over 10% of patients of each diagnosis, and 21 in over one-third of them, including “substance use,” “anxiety,” “suicidal ideation,” and “insomnia” (online suppl. Table 7). At the same time, most phenotypes also showed diagnosis-specific associations (online suppl. Table 8). For example, MDD-enriched phenotypes compared to BD included “depressed mood” (OR = 4.96, p value = 7.13e−110), “worthlessness” (OR = 4.4, p value = 1.46e−159), and “suicidal ideation” (OR = 3.45, p value = 9.54e−113). SCZ-enriched phenotypes compared to BD included “altered affective modulation” (OR = 2.75, p value = 7.33e−27), “hallucinations” (OR = 2.72, p value = 1.73e−40), “persecutory ideation” (OR = 2.68, p value = 4.68e−39).
Percentage of SMI patients with each of 50 phenotypes in CSJDM. Presence of the phenotype was based on detecting at least one asserted instance by the COMBINED algorithm in intake or progress notes of N = 9,737 individuals with at least two intake or progress notes.
Percentage of SMI patients with each of 50 phenotypes in CSJDM. Presence of the phenotype was based on detecting at least one asserted instance by the COMBINED algorithm in intake or progress notes of N = 9,737 individuals with at least two intake or progress notes.
Phenotype-Based MCA of CSJDM SMI Population
MCA of symptom profiles revealed distinct clusters of MDD and SCZ patients, with BD patients overlapping both groups – reflecting the common combination of psychotic and mood symptoms in BD (Fig. 6a). The factor plot showed variance driven by two main dimensions: the first being severity (presence/absence of symptoms) and the second being the spectrum between mood symptoms (e.g., “anhedonia,” “guilt,” “hopelessness”) and psychotic symptoms (e.g., “hallucinatory behavior,” “incoherence,” and “persecutory ideation”) (Fig. 6b).
MCA plot of SMI patients with at least two intake or progress notes at CSJDM (N = 9,737). MCA was based on binary presence/absence of 50 extracted phenotypes. a Plot of 1,000 randomly selected patients from each SMI category. b Factor map of MCA variables colored by symptom domain.
MCA plot of SMI patients with at least two intake or progress notes at CSJDM (N = 9,737). MCA was based on binary presence/absence of 50 extracted phenotypes. a Plot of 1,000 randomly selected patients from each SMI category. b Factor map of MCA variables colored by symptom domain.
Discussion
In this study, we develop and validate a Spanish-language cNLP algorithm for extracting detailed psychiatric symptoms from clinical notes, enabling uniform analysis of symptom patterns across diagnostic categories. This work stands out in multiple ways: first, we targeted an extensive list of phenotypes, designed by combining top-down (literature-based) and bottom-up (data-driven) approaches. Second, we developed a rigorous evaluation workflow, testing both phenotype-level reliability and algorithm performance from the document to the full entry levels. Third, we demonstrated cross-hospital transportability, highlighting the potential for broad adoption across clinical settings. Fourth, our systematic, transdiagnostic phenotyping of SMI at scale revealed extensive symptom overlap across diagnostic categories. Fifth, we address gaps in global mental health research by making our tool and annotations publicly available to foster research in Spanish-language mental health [44].
Several key advantages enable the broad adoption and future expansion of our tool: on one hand, our method is lightweight and fast, making it particularly suitable for large-volume EHR datasets even in resource-constrained settings, addressing critical needs in global mental health informatics [29]. Our approach can be readily deployed within existing EHR infrastructure and used locally by clinicians or researchers without the need for specialized hardware. It supports both real-time symptom information summarization and large-scale retrospective analysis. Potential clinical applications include extracting relevant psychiatric symptoms and psychosocial history concepts from a patient’s longitudinal record to assist clinicians with diagnosis, risk assessments, and treatment plans. For researchers, the extracted phenotypes can facilitate cohort assembly, longitudinal analyses of disease trajectories, and serve as predictors in clinical outcome studies. Additionally, our method is transparent and modular, allowing users to understand and modify individual rules to ensure compatibility with local documentation practices and terminology. We showed that incorporating data from multiple sources generally improved the algorithms’ performance, suggesting that further expansions to additional hospitals could further enhance its applicability. Our finding that most phenotypes reach peak recall after a few hundred annotated documents could guide others aiming to refine our approach and optimize resource allocation in similar projects. While we used specific thresholds to filter phenotypes in our analyses, we are publishing the NER patterns for all phenotypes, allowing users to determine their own criteria and select relevant phenotypes as needed.
Most phenotypes meeting thresholds for support and IAA were reliably extracted at the document level (70/84 at HOMO and 73/88 at CSJDM). Unsurprisingly, phenotypes with lower IAA and greater phrasing variability proved more challenging for our pattern-based approach. Additionally, while most phenotypes evaluated at the entry level in HOMO and CSJDM (50/69 and 50/70) showed good performance, some (“poor insight,” “compromised judgment”) performed poorly across all settings, despite strong test-set support. Reasons for this may include that note sections dedicated to these phenotypes (i.e., “insight” and “judgment and reasoning”) were not used for document annotation and typically contained brief, single-word summaries (e.g., “adequate,” “poor”) complicating pattern-based extraction and context detection. In practice, these fields resemble structured data more than narrative text; therefore, these concepts may be reliably extracted using simple heuristics or direct field-level parsing, rather than full-document NLP. Future implementations will analyze these semi-structured sections using simpler extraction methods rather than general-purpose NLP techniques.
Our cNLP tool was effective at extracting diagnosis-specific and transdiagnostic phenotypes, providing a comprehensive picture of SMI. For example, we observed high frequencies of “suicidal ideation” and “anxiety” across multiple diagnoses. Similarly, our finding that individuals with BD exhibit a phenotypic profile that bridges MDD and SCZ is consistent with its clinical definition as encompassing both mood episodes and psychosis [45]. These transdiagnostic profiles can support efforts in precision psychiatry by identifying patients that, while spanning different diagnoses, share clinical features and may benefit from similar treatments. They may also help clinicians detect significant shifts in a patient’s overall symptom profile over time. Similarly, this high-throughput phenotyping approach, driven by clinical features and not by diagnostic categories, can serve as a framework for future research into genetic associations, disease trajectories, and treatment response across psychiatric conditions. Furthermore, extracted phenotypes include observational features missing from standard assessment tools but often documented in clinical notes, such as poor personal hygiene [8, 44], highlighting the wealth of phenotypic information embedded within clinical narratives. By enabling comprehensive phenotyping, our approach paves the way for a deeper understanding of the relationship between symptoms and diagnostic categories.
Limitations and Future Work
We note several limitations to our approach and directions for future work that span three categories: coverage, methodological constraints, and temporal considerations. Regarding phenotype coverage, our initial annotation sets failed to capture several phenotypes considered core features of SMI, such as “decreased need for sleep” and “flight of ideas,” while others such as “grandiosity” only met frequency criteria in one of the hospitals. This lack of representation likely arises from the inherent imprecisions in psychiatric phenotyping and variations in culture, documentation practices, and training – affecting symptom documentation. For example, “grandiosity” could be described as a “delusion.” Regional variation in clinical training could also explain differences in performance between hospitals (e.g., “low energy” and “thought disorder”). Another contributor to low annotation support may be our document selection strategy, which prioritized sections by phenotype density rather than stratifying by diagnosis, potentially excluding diagnosis-specific phenotypes that are rare in the overall SMI population. This could be addressed in future work through stratified sampling of documents across diagnoses.
Methodologically, ultrarare phenotypes remain difficult to extract and validate with this approach, suggesting the need for larger annotation sets or alternative enrichment strategies. The improved recall of the COMBINED algorithm indicates that incorporating additional sites into the development process could enhance performance for such phenotypes. The emergence of LLMs offers promising opportunities to address these limitations while expanding our framework’s scope. While pattern-based approaches are well suited for direct application to full EHR databases, offering advantages over deep-learning-based models in terms of efficiency, hardware requirements, cost, speed [46, 47], interpretability, and overall performance [47‒49], the use of LLMs for psychiatric phenotyping [50‒53] could improve flexibility and generalizability, where pattern-based approaches are limited by their rigidity. Moreover, deep learning models may show greater recall [32] and better performance in the detection of rare or complex phenotypes, capturing new and diverse expressions through transfer-learning without relying on extensive annotations. Thus, in future work, we aim to use the annotations generated here to fine-tune or prompt LLMs for phenotyping. Additionally, another direction for future work is the expansion of our methodology to capture social determinants of health, which are rarely documented in structured EHR data [54].
Finally, like other healthcare machine-learning applications [55, 56], our tool may be affected by EHR data drifts [57], including changes in documentation practices and regulations, requiring periodic updates to maintain effectiveness. Despite numerous factors complicating phenotype extraction – including a dataset spanning multiple years, recorded by dozens of clinicians across different states – our approach has demonstrated remarkable success at the robust extraction of many core features of SMI.
Conclusion
Our study introduces a robust, Spanish-language clinical NLP algorithm for comprehensive psychiatric phenotyping. The transdiagnostic phenotypes extracted through these methods can be used across clinical and research applications, such as modeling of disease progression, monitoring treatment response, and enabling large-scale genetic investigations of the cause and course of SMI.
Statement of Ethics
This study was performed in accordance with the Declaration of Helsinki. This human study was approved by the Institutional Review Board (IRB), Medical Institutional Review Board 3, at UCLA, the Comité de Ética del Instituto de Investigaciones Médicas at Universidad de Antioquia (UdeA), the Comité de Ética en Investigación de la E.S.E at HOMO, and the Comité de Bioética de Clínica San Juan de Dios at CSJDM – Approval Nos. IRB#20-000149, IRB#20-001537, and IRB#16-002084. All adult participants provided written informed consent to participate in studies IRB#20-000149, IRB#20-001537, and IRB#16-002084; in addition, a waiver of consent and IRB approval by Comité de Bioética de Clínica San Juan de Dios at CSJDM was granted to access all EHR data. All privacy rights of human subjects have been observed.
Conflict of Interest Statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Funding Sources
This work was supported by the National Institute of Mental Health (Grant Nos. R01 MH123157 [to L.M.O.L., C.L.-J., and N.B.F.], R01 MH113078 [to C.L.-J. and N.B.F.], R00 MH116115 [to L.M.O.L.], U01MH125042 [to L.M.O.L., C.L.-J., and N.B.F.]); the Fulbright Commission in Colombia under the Fulbright-Minciencias grant (to J.F.D.L.H.); and the UCLA Cota-Robles Fellowship (to C.F.-G.).
Author Contributions
Juan F. De La Hoz and Clara Frydman-Gani: conceptualization, methodology, formal analysis, and writing – original draft; Alejandro Arias: data curation, methodology, and formal analysis; Maria Perez Vallejo, John Daniel Londoño Martínez, and Mauricio Castaño: formal analysis and methodology; Laura Mena and Ariel Seroussi: formal analysis; Susan K. Service: methodology; Ana M. Diaz-Zuluaga and Ana M. Ramirez-Diaz: data curation; Johanna Valencia-Echeverry: project administration; Victor I. Reus: conceptualization; Alex A.T. Bui: conceptualization and methodology; Nelson B. Freimer and Loes M. Olde Loohuis: conceptualization, methodology, supervision, and writing – original draft; Carlos Lopez-Jaramillo: data curation and supervision. All authors: writing – review and editing.
Additional Information
Juan F. De La Hoz and Clara Frydman-Gani contributed equally to this work.
Data Availability Statement
The full annotated corpus used to develop these algorithms consists of clinical notes containing protected patient data and cannot be shared publicly. The specific spans annotated for all concepts do not contain private health information and can be found in online supplementary Table 3. The full set of patterns developed as part of this work is publicly available for application and further development, and can be found at https://github.com/clarafrydman/Spanish_Psych_Phenotyping. Further inquiries can be directed to the corresponding author.