Introduction: Clinical notes in electronic health records offer valuable insight into the symptom profiles and trajectories of patients with severe mental illness (SMI). However, systematically extracting symptoms at scale remains a challenge, especially in languages other than English. We developed a light, accurate, and interpretable natural language processing (NLP) algorithm to extract psychiatric phenotypes from Spanish clinical notes. Methods: We selected a set of 136 core psychiatric phenotypes and annotated 4,000 clinical note sections (e.g., Chief Complaint, Plan; called “documents”) and 240 complete visit notes (called “entries”) from two psychiatric hospitals in Colombia: Hospital Mental de Antioquia (HOMO) and Clínica San Juan de Dios Manizales (CSJDM). For phenotypes meeting frequency and inter-annotator reliability thresholds, we developed three NLP algorithms (HOMO, CSJDM, and COMBINED) for phenotype extraction and context labeling (e.g., negation, family history, uncertainty). We evaluated performance at the document and entry levels, as well as across hospitals. Results: Document-level performance at both hospitals was high (average F1 scores of 0.84 and 0.85). Moreover, on phenotypes meeting our document-level performance threshold of F1 ≥0.7, entry-level performance was high as well (average F1 of 0.75 and 0.78), as was the cross-hospital transportability of the algorithms (F1 of 0.75 HOMO-to-CSJDM and 0.77 CSJDM-to-HOMO). The COMBINED algorithm improved overall recall, without significantly decreasing precision (F1 of 0.78 and 0.77 on HOMO and CSJDM, respectively). The application of our algorithm for 50 high-performing phenotypes to the notes of 9,737 SMI patients highlighted the transdiagnostic nature of many core SMI phenotypes; 44/50 phenotypes were recorded in over 10% of patients across diagnoses. Multiple correspondence analysis further revealed variation in symptom space across diagnoses; while major depressive disorder and schizophrenia form distinct clusters, patients with bipolar disorder span the entire phenotypic spectrum. Conclusion: Our tool enables the systematic investigation of psychiatric symptoms from psychiatric notes, facilitating large-scale investigations in Spanish-speaking populations.

Clinical notes within electronic health records (EHRs) capture detailed descriptions of patients’ symptoms, behaviors, treatment responses, and other information unavailable in structured format [1‒4]. In the absence of biomarkers for severe mental illness (SMI), a patient’s condition is most accurately described in free-text narratives, making these essential for diagnosis and treatment [5], and a valuable resource for biomedical research. Despite the potential for these clinical notes to serve as low-cost resources for deep, high-throughput phenotyping [6‒11], the availability of such tools remains limited, especially for non-English applications.

Recent studies have demonstrated the potential of clinical natural language processing (cNLP) to advance psychiatric research by enabling uniform phenotyping irrespective of conventional diagnostic categories, aligning with initiatives like the Research Domain Criteria [12] or the Hierarchical Taxonomy of Psychopathology [13], which promote transdiagnostic approaches to phenotyping. This alignment between computational methods and theoretical frameworks may promote the discovery of biological mechanisms or treatment targets that transcend traditional diagnostic boundaries. For example, McCoy et al. [14] developed a cNLP method to extract five Research Domain Criteria domains from clinical notes and further identified genetic associations with these phenotypes within an EHR-linked biobank in the USA. Similarly, Jackson et al. [15] developed the Clinical Record Interactive Search (CRIS) NLP system to extract transdiagnostic symptoms of SMI within the South London and Maudsley Hospital in the UK, a tool used for early detection of individuals at high risk of psychosis. Additionally, cNLP tools are being developed to improve patient care, with recent studies focusing on detecting risk for psychosis [16], predicting suicidal behavior and suicide attempts [4, 17, 18], and identifying mental health crises [19]. Despite these advancements, the development of cNLP tools for diverse global contexts remains limited.

Most cNLP tools are developed in high-resource, English-speaking environments and are validated using exclusively English-language clinical notes. While Spanish-language cNLP tools have been developed to extract phenotypes in certain clinical domains [20‒22], Spanish-language tools for the extraction of psychiatric phenotypes from EHR databases remain scarce, despite Spanish being the second most common native language in the world. Broad adoption of cNLP tools is not only hampered by language differences; even within the same language and country, differences across healthcare systems and cultural contexts can limit their transferability [23‒26]. This issue, while commonly acknowledged, is rarely addressed through multisite evaluations [23, 27, 28], partly because robust cNLP tools require substantial resources to be developed, validated, and deployed [29, 30]. While recent cNLP advances have favored transformer-based approaches for psychiatric phenotype extraction from English text [31, 32], pattern-based NLP remains a practical and effective alternative in many contexts. Pattern-based approaches enable the processing of millions of EHR documents with minimal infrastructure and without the need for specialized hardware (i.e., graphics processing units), which are largely unavailable in resource-constrained settings. Additionally, rule-based systems produce transparent, easily auditable outputs that are particularly valuable in clinical environments. Our aim with this study was to develop an accessible and scalable cNLP tool to be adopted in low-to-medium resource, Spanish-speaking settings, thereby accelerating adoption of EHRs in global mental health research.

Here, we describe the development of a cNLP tool to comprehensively extract psychiatric phenotypes from Spanish clinical notes and apply our algorithm to a cohort of over 9,000 SMI patients, analyzing symptom prevalence across diagnostic categories and demonstrating its potential to enable large-scale psychiatric research. Our approach relies on rule-based techniques [33, 34] to create a fast, light, accurate, and interpretable system for phenotype extraction, which can be realistically applied to full EHR databases comprising millions of notes without requiring the use of graphics processing units. We defined a comprehensive list of phenotypes through complementary strategies, including literature-based and data-driven ones, as well as through a hierarchical ontology in which “narrow” phenotypes aggregate into higher-order, “broad,” ones. To the best of our knowledge, this is the first tool that enables the accurate, rapid, and comprehensive extraction of psychiatric phenotypes from Spanish clinical text. We evaluated the transferability of our approach by developing and validating our algorithms across two psychiatric medical centers in Colombia: the Hospital Mental de Antioquia (HOMO) and the Clínica San Juan de Dios Manizales (CSJDM).

Our study has five sections (Fig. 1): (1) selection of phenotypes to extract, (2) phenotype annotation, which includes identifying phenotype-rich sections from the clinical notes (hereafter referred to as “documents”), selecting patients for review of complete clinical notes (hereafter referred to as “entries”), independent annotation of documents, selection of document test sets, and consensus annotation of the test sets (documents and entries), (3) algorithm development from rule-based patterns, (4) algorithm evaluation – we evaluated our algorithm across three levels: at the document and entry levels, and cross-hospital evaluation, and (5) application to phenotyping of a large SMI cohort. All analyses were done in Python (v3.10.8) and R (v4.2.2).

Fig. 1.

Schematic depiction of the development and evaluation of the NLP algorithms. a Document selection: 2,000 documents and 120 full notes (“entries”) corresponding to SMI patients were selected from the EHR database of each hospital. b Development set documents were independently annotated by two clinicians; for the test sets, a consensus annotation was generated for a subset of the documents (n = 309 and n = 358 in HOMO and CSJDM, respectively) and the 120 entries. c The annotations on the development set documents (n = 1,691 and n = 1,642) were used to manually generate patterns for MedSpaCy. d The algorithms were tested on the consensus annotation of documents and entries. Three sets of patterns were used for entry-level testing, those coming from the annotations of the same hospital, those from the other hospital, and a combination of both (“COMBINED”). e Phenotype filtering through annotation and evaluation steps in HOMO (green) and CSJDM (blue).

Fig. 1.

Schematic depiction of the development and evaluation of the NLP algorithms. a Document selection: 2,000 documents and 120 full notes (“entries”) corresponding to SMI patients were selected from the EHR database of each hospital. b Development set documents were independently annotated by two clinicians; for the test sets, a consensus annotation was generated for a subset of the documents (n = 309 and n = 358 in HOMO and CSJDM, respectively) and the 120 entries. c The annotations on the development set documents (n = 1,691 and n = 1,642) were used to manually generate patterns for MedSpaCy. d The algorithms were tested on the consensus annotation of documents and entries. Three sets of patterns were used for entry-level testing, those coming from the annotations of the same hospital, those from the other hospital, and a combination of both (“COMBINED”). e Phenotype filtering through annotation and evaluation steps in HOMO (green) and CSJDM (blue).

Close modal

Setting

We accessed clinical notes from two psychiatric hospitals in Colombia: HOMO (370 beds), the only public psychiatric hospital in the department of Antioquia (population 7 million), and CSJDM (220 beds) the primary mental health clinic for the department of Caldas (population 1 million) – which treats patients regardless of insurance status. As part of three genetic studies (Paisa Project, Misión Origen, and PUMAS) [35‒37], our team has enrolled patients at HOMO since 2020 and at CSJDM since 2017. Eligible participants were at least 18 years of age and had a diagnosis of SMI (online suppl. note 1; for all online suppl. material, see https://doi.org/10.1159/000546480). For this study, we accessed the records of consenting participants at HOMO spanning 2013–2022 and a de-identified version of the entire EHR database at CSJDM spanning 2016–2022. The composition of clinical notes from both hospitals is described in online supplementary note 2.

Selection of Phenotypes

To compile 136 candidate phenotypes to extract from clinical notes (online suppl. Table 1), we first translated 54 phenotypes from the CRIS NLP Applications Library [15]. Then, we identified 28 phenotypes through the review of 20 random CSJDM entries by Colombia- and US-based psychiatrists and clinical researchers. An additional 46 phenotypes were identified during the annotation of the first 100 documents. Finally, the phenotype hierarchy introduced eight additional parent phenotypes (e.g., “psychomotor alterations” as a parent of “psychomotor delay,” online suppl. Table 2). All but these eight parent phenotypes (i.e., 128 phenotypes) were directly annotated in the annotation stage.

Phenotype Annotation

Selection of Documents. Clinical notes at both hospitals are divided into distinct fields (online suppl. note 2), including physical and mental status exams, and “SOAP” components: subjective, objective, assessment, and plan. To increase the likelihood of identifying phenotypes for annotation, we focused on the five fields with the highest concentration of phenotypes as identified using regular expression matching (online suppl. note 3). We randomly selected 400 documents (30 characters or longer) from each field, resulting in a total of 2,000 documents per hospital – corresponding to 1,828 patients in HOMO and 1,821 in CSJDM (Fig. 1a).

Independent Annotation of Documents. Two expert research clinicians independently annotated 128 phenotypes in 2,000 documents from each clinic using WebAnno 3.6.5 [38, 39] (Fig. 1b). To maximize alignment between annotators, the first 20 documents were reviewed and discussed jointly to resolve phenotyping ambiguities. Additional reviews were done at the completion of 50, 100, and 500 document annotations. MedSpaCy’s English ConText algorithm [40] uses cue phrases to detect instances of negation, uncertainty, hypothetical scenarios, and historical or third-party references. We annotated corresponding cue phrases for these categories to adapt this algorithm for Spanish.

Selection of Documents for Test Set. We split the 2,000 annotated documents from each hospital into development and test sets. Since a random split would likely result in an imbalance of low-frequency phenotypes, we ensured an adequate balance by requiring that at least 50% of the instances of each phenotype appear in the development set and at least 20% appear in the test set (online suppl. note 4). This resulted in 1,691 (development) and 309 (test) documents at HOMO, and 1,642 and 358 documents at CSJDM (Fig. 1b).

Selection of Patients for Chart Review (Entries). To evaluate our NLP framework at the entry level (i.e., complete notes), we randomly selected 120 patients from each hospital (240 total) and 40 from each of the three diagnostic categories: major depressive disorder (MDD), bipolar disorder (BD), and schizophrenia (SCZ; Fig. 1a). Diagnostic labels were obtained from the EHR as the primary diagnosis at recruitment. To capture heightened symptom severity, entries were prioritized in the following order: (1) the most recent hospital intake note, (2) most recent emergency department note, or (3) most recent outpatient note.

Independent Annotation of Patient Entries. Two annotators independently identified and annotated phenotypes in each entry. Furthermore, to evaluate the ConText cues, we categorized phenotypes as “non-asserted” if they were negated, uncertain, hypothetical, historical, or referred to someone other than the patient, and “asserted” otherwise.

Consensus Annotation of Test Sets (Documents and Entries). To ensure a consistent and accurate baseline for evaluation, we performed consensus annotations on the document- and entry-level test sets, where annotators jointly resolved any discrepancies.

Support and Inter-Annotator Agreement (IAA). For each phenotype in the development set, we calculated Cohen’s kappa [41] and support (union of instances identified by either annotator). Parent phenotypes included all hierarchical descendants. Pattern development was only done for phenotypes with support ≥10 and kappa ≥0.6.

Algorithm Development

Our NLP pipeline was built using spaCy (v3.5.4) [42] and its clinical extension, medSpaCy (v1.1.2) [33]. spaCy combines machine-learning and pattern-based methods for constructing high-performance NLP pipelines, while medSpaCy provides components for cNLP operations, including clinical sentence tokenization (using the RuSH algorithm [38]), named entity recognition (NER), and context analysis (using the ConText algorithm [40]). The heart of our algorithm lies in its NER patterns. We developed these manually based on the development sets from each hospital, resulting in two distinct algorithms (“HOMO” and “CSJDM”; Fig. 1c; online suppl. Table 3). Each NER pattern represents a concept’s syntactic or phrasal variant, accounting for lexical and morphological variations, verb conjugations, lemma derivations, and common typos. We also generated a common set of RuSH patterns (sentence tokenizers) based on medSpaCy’s defaults and Spanish ConText patterns derived from the annotated cues.

Algorithm Evaluation

Document-Level Evaluation. We evaluated each algorithm’s phenotype recognition performance on its respective test set (Fig. 1d), excluding phenotypes with fewer than three instances in the test set and testing them individually to prevent ambiguity due to overlapping spans. For each phenotype, we measured precision, recall, and F1 score, and analyzed the correlation between these performance metrics and factors like support and IAA. High-performing phenotypes (F1 ≥0.70) were further evaluated at the entry level (Fig. 1e).

Patient Entry-Level Evaluation. We assessed each algorithm’s ability to detect asserted phenotypes within full patient entries – evaluating both the NER and ConText components together. As above, each algorithm was tested on its respective test set, excluding phenotypes with fewer than three instances in the test set, and testing them individually.

Cross-Hospital Evaluation. We assessed our algorithms’ transportability between hospitals through cross-testing and combined testing (Fig. 1d). For cross-testing, we evaluated HOMO’s algorithm on CSJDM’s entry-level test set (HOMO-to-CSJDM) and vice versa (CSJDM-to-HOMO). For combined testing, we merged the NER patterns from both hospitals into a single algorithm (hereafter referred to as “COMBINED”) and evaluated it on each entry-level test set.

Application to Phenotyping

Application to EHR of SMI Patients. We applied the COMBINED algorithm to intake and progress notes of SMI patients at CSJDM, recorded between 2016 and 2022 (online suppl. note 5). Individuals were labeled as MDD, BD, or SCZ, based on their most recent SMI ICD-10 code. We included only phenotypes achieving F1 ≥0.70 in the evaluation of COMBINED-to-CSJDM, though different thresholds could be used depending on the analysis goals. We analyzed patients with two or more intake or progress notes [34], defining phenotypes as present if mentioned at least once as asserted in any note. Using logistic regression, we tested for associations between phenotypes and diagnoses (online suppl. note 6) after accounting for note count and adjusting p values for multiple comparisons (Bonferroni) [34].

Phenotype Analysis of SMI Patients. We used multiple correspondence analysis (MCA) [43] to investigate phenotype co-occurrence within and between SMI diagnoses. This projected the patients’ phenotypic profiles onto a lower dimensional space for visualization. For clarity of presentation, we randomly selected 1,000 patients from each diagnostic category for plotting.

Annotation and IAA

Our annotation of 136 phenotypes across 4,000 documents yielded 11,900 unique mentions at HOMO and 15,862 at CSJDM (the full set of annotations can be found in online suppl. Table 3, and observations regarding the consensus results in online suppl. note 7). After excluding phenotypes with insufficient support (44 at HOMO and 40 at CSJDM) or low reliability (eight in each hospital), we retained 84 phenotypes at HOMO and 88 at CSJDM for NER pattern development. These phenotypes showed high IAA, with average kappa values of 0.88 (HOMO) and 0.87 (CSJDM).

Generation of Patterns for NER, Sentence Tokenizer, and ConText

We developed 601 NER patterns for 84 phenotypes at HOMO and 695 patterns for 88 phenotypes at CSJDM (online suppl. Fig. 2), optimizing precision and recall on the development set (Fig. 2; online suppl. Fig. 3). Recall in the test set improved with increasing size of the development set, typically plateauing after the annotation of 500–1,000 documents (Fig. 3; online suppl. Fig. 4). Additionally, we created 56 RuSH patterns for sentence tokenization and 103 ConText patterns (24 negation, 27 uncertainty, 18 hypothetical, 24 historical, and 9 third-party references). All patterns are available at https://github.com/clarafrydman/Spanish_Psych_Phenotyping for broad use and further development.

Fig. 2.

Process of deriving phenotype-specific patterns from annotations for two phenotypes, referential thinking (a) and apathy (b), in the CSJDM algorithm, alongside saturation curves showing the changes in recall and precision in the development set for every new pattern (e.g., “referential” pattern) added to a concept (i.e., “referential thinking”). Some patterns included both LEMMA matching (base word forms, e.g., running → run) and FUZZY matching (allowing for minor spelling variations). Terms shown in Spanish are translated as follows: ideas delirantes de referencialidad (“delusions of reference”); apatía (“apathy”); todo me da igual (“everything is the same to me,” or “I don’t care about anything”); no me importa la vida (“I don’t care about life”); pérdida del interés (“loss of interest”).

Fig. 2.

Process of deriving phenotype-specific patterns from annotations for two phenotypes, referential thinking (a) and apathy (b), in the CSJDM algorithm, alongside saturation curves showing the changes in recall and precision in the development set for every new pattern (e.g., “referential” pattern) added to a concept (i.e., “referential thinking”). Some patterns included both LEMMA matching (base word forms, e.g., running → run) and FUZZY matching (allowing for minor spelling variations). Terms shown in Spanish are translated as follows: ideas delirantes de referencialidad (“delusions of reference”); apatía (“apathy”); todo me da igual (“everything is the same to me,” or “I don’t care about anything”); no me importa la vida (“I don’t care about life”); pérdida del interés (“loss of interest”).

Close modal
Fig. 3.

Saturation curves showing change in recall for the phenotypes referential thinking (a) and apathy (b) at the document-level evaluation in CSJDM, as a function of number of documents annotated as the development set. Solid line and shaded area correspond to mean and 95% CI, respectively, of five random samples.

Fig. 3.

Saturation curves showing change in recall for the phenotypes referential thinking (a) and apathy (b) at the document-level evaluation in CSJDM, as a function of number of documents annotated as the development set. Solid line and shaded area correspond to mean and 95% CI, respectively, of five random samples.

Close modal

Evaluation of the Algorithms

Our NER patterns achieved high document-level performance. Phenotypes at HOMO (test set n = 309) had mean precision/recall/F1 of 0.90/0.82/0.84, while phenotypes at CSJDM (test set n = 358) scored 0.91/0.82/0.85 (online suppl. Table 5). In both settings, phenotype performance (F1) was positively correlated with frequency in the test set (HOMO: r = 0.33, p = 0.002; CSJDM: r = 0.25, p = 0.016), IAA (r = 0.40, p = 0.0003; r = 0.54, p = 4.93e−8), and annotation-to-pattern ratio (r = 0.31, p = 0.005; r = 0.45, p = 1.02e−5) (online suppl. Fig. 5), indicating better performance for common phenotypes with consistent phrasing. After excluding 14 and 15 phenotypes with F1 <0.70, we retained 70 and 73 phenotypes at HOMO and CSJDM, respectively, for reliable extraction from documents. From these, 56 were shared between the two hospitals (online suppl. Table 5; Fig. 1e).

At the entry level, both algorithms maintained strong performance when combining NER and ConText patterns. After removing phenotypes with insufficient support across 120 entries, we evaluated 69 phenotypes at HOMO and 70 at CSJDM. HOMO achieved mean precision/recall/F1 of 0.86/0.71/0.75, while CSJDM scored 0.88/0.74/0.78 (online suppl. Table 6). This within-hospital evaluation identified 50 high-performing phenotypes in each hospital (F1 ≥0.70), of which 30 were shared across hospitals (Fig. 1e). Some phenotypes showed marked cross-hospital differences – for example, “low energy” achieved F1 scores of 0.95 at CSJDM but 0.44 at HOMO, while for “thought disorder” F1 was 0.78 at HOMO but 0.49 at CSJDM. Online supplementary Table 9 and online supplementary Figures 6 and 7 detail each phenotype’s status at each evaluation stage.

Cross-Hospital Evaluation

Cross-hospital evaluation demonstrated high portability of all algorithms, with most phenotypes (70–76%) maintaining high performance (Fig. 4; online suppl. Fig. 8; online suppl. Table 6). Remarkably, all were highly portable: on the HOMO entries (n = 120), the mean F1 was 0.77 (CSJDM; 69 phenotypes) and 0.78 (COMBINED; 69 phenotypes). On the CSJDM entries (n = 120), mean F1 was 0.75 (HOMO; 64 phenotypes) and 0.77 (COMBINED; 70 phenotypes). The COMBINED algorithm outperformed hospital-specific algorithms on their native test sets: surpassing HOMO in 35/69 phenotypes and CSJDM in 17/70 phenotypes. Furthermore, the COMBINED algorithm improved recall by >0.1 for 22 phenotypes at HOMO and 8 at CSJDM, likely due to these phenotypes’ lower support in their development sets. These results demonstrate both successful cross-hospital transferability and the superior performance of the COMBINED algorithm (online suppl. Table 6).

Fig. 4.

Support and F1 scores of each phenotype evaluated for all algorithms on the CSJDM entry-level test set (n = 120). a F1 scores for the CSJDM (dark blue), HOMO (green), and COMBINED (light blue) algorithms, on the CSJDM test set. Only the 53 phenotypes passing upstream quality criteria in CSJDM and HOMO are displayed: a minimum of ten annotations in the HOMO and CSJDM development sets as well as kappa values of at least 0.60, an F1 score of at least 0.70 for the document-level evaluation in both CSJDM and HOMO, and support ≥3 in the entry-level CSJDM test set. Phenotypes are ordered by support levels in the CSJDM test set. b Support for each phenotype in the CSJDM entry-level test set (blue, left axis, n = 120) and development set (orange, right axis, n = 1,642;).

Fig. 4.

Support and F1 scores of each phenotype evaluated for all algorithms on the CSJDM entry-level test set (n = 120). a F1 scores for the CSJDM (dark blue), HOMO (green), and COMBINED (light blue) algorithms, on the CSJDM test set. Only the 53 phenotypes passing upstream quality criteria in CSJDM and HOMO are displayed: a minimum of ten annotations in the HOMO and CSJDM development sets as well as kappa values of at least 0.60, an F1 score of at least 0.70 for the document-level evaluation in both CSJDM and HOMO, and support ≥3 in the entry-level CSJDM test set. Phenotypes are ordered by support levels in the CSJDM test set. b Support for each phenotype in the CSJDM entry-level test set (blue, left axis, n = 120) and development set (orange, right axis, n = 1,642;).

Close modal

Eleven phenotypes consistently achieved very high performance (F1 ≥0.85) across algorithms and test sets, including “delusions,” “anxiety,” “worthlessness,” “hallucinatory behavior,” and “substance use.” In contrast, “poor insight” and “compromised judgment” consistently performed poorly (F1 ≤0.22 and ≤0.16, respectively) despite having high support (66–89; Fig. 4; online suppl. Fig. 8).

Application of the NLP Algorithm to the SMI Population

We used the COMBINED algorithm to analyze clinical notes of 9,737 SMI patients at CSJDM (4,618 MDD, 3,656 BD, 1,463 SCZ) with at least two intake or progress notes between 2016 and 2022 – comprising 25,599 intake and 315,190 progress notes. Of 50 phenotypes that achieved F1 ≥0.70 in the COMBINED-to-CSJDM evaluation (Fig. 5), most proved transdiagnostic: 44 were recorded in over 10% of patients of each diagnosis, and 21 in over one-third of them, including “substance use,” “anxiety,” “suicidal ideation,” and “insomnia” (online suppl. Table 7). At the same time, most phenotypes also showed diagnosis-specific associations (online suppl. Table 8). For example, MDD-enriched phenotypes compared to BD included “depressed mood” (OR = 4.96, p value = 7.13e−110), “worthlessness” (OR = 4.4, p value = 1.46e−159), and “suicidal ideation” (OR = 3.45, p value = 9.54e−113). SCZ-enriched phenotypes compared to BD included “altered affective modulation” (OR = 2.75, p value = 7.33e−27), “hallucinations” (OR = 2.72, p value = 1.73e−40), “persecutory ideation” (OR = 2.68, p value = 4.68e−39).

Fig. 5.

Percentage of SMI patients with each of 50 phenotypes in CSJDM. Presence of the phenotype was based on detecting at least one asserted instance by the COMBINED algorithm in intake or progress notes of N = 9,737 individuals with at least two intake or progress notes.

Fig. 5.

Percentage of SMI patients with each of 50 phenotypes in CSJDM. Presence of the phenotype was based on detecting at least one asserted instance by the COMBINED algorithm in intake or progress notes of N = 9,737 individuals with at least two intake or progress notes.

Close modal

Phenotype-Based MCA of CSJDM SMI Population

MCA of symptom profiles revealed distinct clusters of MDD and SCZ patients, with BD patients overlapping both groups – reflecting the common combination of psychotic and mood symptoms in BD (Fig. 6a). The factor plot showed variance driven by two main dimensions: the first being severity (presence/absence of symptoms) and the second being the spectrum between mood symptoms (e.g., “anhedonia,” “guilt,” “hopelessness”) and psychotic symptoms (e.g., “hallucinatory behavior,” “incoherence,” and “persecutory ideation”) (Fig. 6b).

Fig. 6.

MCA plot of SMI patients with at least two intake or progress notes at CSJDM (N = 9,737). MCA was based on binary presence/absence of 50 extracted phenotypes. a Plot of 1,000 randomly selected patients from each SMI category. b Factor map of MCA variables colored by symptom domain.

Fig. 6.

MCA plot of SMI patients with at least two intake or progress notes at CSJDM (N = 9,737). MCA was based on binary presence/absence of 50 extracted phenotypes. a Plot of 1,000 randomly selected patients from each SMI category. b Factor map of MCA variables colored by symptom domain.

Close modal

In this study, we develop and validate a Spanish-language cNLP algorithm for extracting detailed psychiatric symptoms from clinical notes, enabling uniform analysis of symptom patterns across diagnostic categories. This work stands out in multiple ways: first, we targeted an extensive list of phenotypes, designed by combining top-down (literature-based) and bottom-up (data-driven) approaches. Second, we developed a rigorous evaluation workflow, testing both phenotype-level reliability and algorithm performance from the document to the full entry levels. Third, we demonstrated cross-hospital transportability, highlighting the potential for broad adoption across clinical settings. Fourth, our systematic, transdiagnostic phenotyping of SMI at scale revealed extensive symptom overlap across diagnostic categories. Fifth, we address gaps in global mental health research by making our tool and annotations publicly available to foster research in Spanish-language mental health [44].

Several key advantages enable the broad adoption and future expansion of our tool: on one hand, our method is lightweight and fast, making it particularly suitable for large-volume EHR datasets even in resource-constrained settings, addressing critical needs in global mental health informatics [29]. Our approach can be readily deployed within existing EHR infrastructure and used locally by clinicians or researchers without the need for specialized hardware. It supports both real-time symptom information summarization and large-scale retrospective analysis. Potential clinical applications include extracting relevant psychiatric symptoms and psychosocial history concepts from a patient’s longitudinal record to assist clinicians with diagnosis, risk assessments, and treatment plans. For researchers, the extracted phenotypes can facilitate cohort assembly, longitudinal analyses of disease trajectories, and serve as predictors in clinical outcome studies. Additionally, our method is transparent and modular, allowing users to understand and modify individual rules to ensure compatibility with local documentation practices and terminology. We showed that incorporating data from multiple sources generally improved the algorithms’ performance, suggesting that further expansions to additional hospitals could further enhance its applicability. Our finding that most phenotypes reach peak recall after a few hundred annotated documents could guide others aiming to refine our approach and optimize resource allocation in similar projects. While we used specific thresholds to filter phenotypes in our analyses, we are publishing the NER patterns for all phenotypes, allowing users to determine their own criteria and select relevant phenotypes as needed.

Most phenotypes meeting thresholds for support and IAA were reliably extracted at the document level (70/84 at HOMO and 73/88 at CSJDM). Unsurprisingly, phenotypes with lower IAA and greater phrasing variability proved more challenging for our pattern-based approach. Additionally, while most phenotypes evaluated at the entry level in HOMO and CSJDM (50/69 and 50/70) showed good performance, some (“poor insight,” “compromised judgment”) performed poorly across all settings, despite strong test-set support. Reasons for this may include that note sections dedicated to these phenotypes (i.e., “insight” and “judgment and reasoning”) were not used for document annotation and typically contained brief, single-word summaries (e.g., “adequate,” “poor”) complicating pattern-based extraction and context detection. In practice, these fields resemble structured data more than narrative text; therefore, these concepts may be reliably extracted using simple heuristics or direct field-level parsing, rather than full-document NLP. Future implementations will analyze these semi-structured sections using simpler extraction methods rather than general-purpose NLP techniques.

Our cNLP tool was effective at extracting diagnosis-specific and transdiagnostic phenotypes, providing a comprehensive picture of SMI. For example, we observed high frequencies of “suicidal ideation” and “anxiety” across multiple diagnoses. Similarly, our finding that individuals with BD exhibit a phenotypic profile that bridges MDD and SCZ is consistent with its clinical definition as encompassing both mood episodes and psychosis [45]. These transdiagnostic profiles can support efforts in precision psychiatry by identifying patients that, while spanning different diagnoses, share clinical features and may benefit from similar treatments. They may also help clinicians detect significant shifts in a patient’s overall symptom profile over time. Similarly, this high-throughput phenotyping approach, driven by clinical features and not by diagnostic categories, can serve as a framework for future research into genetic associations, disease trajectories, and treatment response across psychiatric conditions. Furthermore, extracted phenotypes include observational features missing from standard assessment tools but often documented in clinical notes, such as poor personal hygiene [8, 44], highlighting the wealth of phenotypic information embedded within clinical narratives. By enabling comprehensive phenotyping, our approach paves the way for a deeper understanding of the relationship between symptoms and diagnostic categories.

Limitations and Future Work

We note several limitations to our approach and directions for future work that span three categories: coverage, methodological constraints, and temporal considerations. Regarding phenotype coverage, our initial annotation sets failed to capture several phenotypes considered core features of SMI, such as “decreased need for sleep” and “flight of ideas,” while others such as “grandiosity” only met frequency criteria in one of the hospitals. This lack of representation likely arises from the inherent imprecisions in psychiatric phenotyping and variations in culture, documentation practices, and training – affecting symptom documentation. For example, “grandiosity” could be described as a “delusion.” Regional variation in clinical training could also explain differences in performance between hospitals (e.g., “low energy” and “thought disorder”). Another contributor to low annotation support may be our document selection strategy, which prioritized sections by phenotype density rather than stratifying by diagnosis, potentially excluding diagnosis-specific phenotypes that are rare in the overall SMI population. This could be addressed in future work through stratified sampling of documents across diagnoses.

Methodologically, ultrarare phenotypes remain difficult to extract and validate with this approach, suggesting the need for larger annotation sets or alternative enrichment strategies. The improved recall of the COMBINED algorithm indicates that incorporating additional sites into the development process could enhance performance for such phenotypes. The emergence of LLMs offers promising opportunities to address these limitations while expanding our framework’s scope. While pattern-based approaches are well suited for direct application to full EHR databases, offering advantages over deep-learning-based models in terms of efficiency, hardware requirements, cost, speed [46, 47], interpretability, and overall performance [47‒49], the use of LLMs for psychiatric phenotyping [50‒53] could improve flexibility and generalizability, where pattern-based approaches are limited by their rigidity. Moreover, deep learning models may show greater recall [32] and better performance in the detection of rare or complex phenotypes, capturing new and diverse expressions through transfer-learning without relying on extensive annotations. Thus, in future work, we aim to use the annotations generated here to fine-tune or prompt LLMs for phenotyping. Additionally, another direction for future work is the expansion of our methodology to capture social determinants of health, which are rarely documented in structured EHR data [54].

Finally, like other healthcare machine-learning applications [55, 56], our tool may be affected by EHR data drifts [57], including changes in documentation practices and regulations, requiring periodic updates to maintain effectiveness. Despite numerous factors complicating phenotype extraction – including a dataset spanning multiple years, recorded by dozens of clinicians across different states – our approach has demonstrated remarkable success at the robust extraction of many core features of SMI.

Our study introduces a robust, Spanish-language clinical NLP algorithm for comprehensive psychiatric phenotyping. The transdiagnostic phenotypes extracted through these methods can be used across clinical and research applications, such as modeling of disease progression, monitoring treatment response, and enabling large-scale genetic investigations of the cause and course of SMI.

This study was performed in accordance with the Declaration of Helsinki. This human study was approved by the Institutional Review Board (IRB), Medical Institutional Review Board 3, at UCLA, the Comité de Ética del Instituto de Investigaciones Médicas at Universidad de Antioquia (UdeA), the Comité de Ética en Investigación de la E.S.E at HOMO, and the Comité de Bioética de Clínica San Juan de Dios at CSJDM – Approval Nos. IRB#20-000149, IRB#20-001537, and IRB#16-002084. All adult participants provided written informed consent to participate in studies IRB#20-000149, IRB#20-001537, and IRB#16-002084; in addition, a waiver of consent and IRB approval by Comité de Bioética de Clínica San Juan de Dios at CSJDM was granted to access all EHR data. All privacy rights of human subjects have been observed.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

This work was supported by the National Institute of Mental Health (Grant Nos. R01 MH123157 [to L.M.O.L., C.L.-J., and N.B.F.], R01 MH113078 [to C.L.-J. and N.B.F.], R00 MH116115 [to L.M.O.L.], U01MH125042 [to L.M.O.L., C.L.-J., and N.B.F.]); the Fulbright Commission in Colombia under the Fulbright-Minciencias grant (to J.F.D.L.H.); and the UCLA Cota-Robles Fellowship (to C.F.-G.).

Juan F. De La Hoz and Clara Frydman-Gani: conceptualization, methodology, formal analysis, and writing – original draft; Alejandro Arias: data curation, methodology, and formal analysis; Maria Perez Vallejo, John Daniel Londoño Martínez, and Mauricio Castaño: formal analysis and methodology; Laura Mena and Ariel Seroussi: formal analysis; Susan K. Service: methodology; Ana M. Diaz-Zuluaga and Ana M. Ramirez-Diaz: data curation; Johanna Valencia-Echeverry: project administration; Victor I. Reus: conceptualization; Alex A.T. Bui: conceptualization and methodology; Nelson B. Freimer and Loes M. Olde Loohuis: conceptualization, methodology, supervision, and writing – original draft; Carlos Lopez-Jaramillo: data curation and supervision. All authors: writing – review and editing.

Additional Information

Juan F. De La Hoz and Clara Frydman-Gani contributed equally to this work.

The full annotated corpus used to develop these algorithms consists of clinical notes containing protected patient data and cannot be shared publicly. The specific spans annotated for all concepts do not contain private health information and can be found in online supplementary Table 3. The full set of patterns developed as part of this work is publicly available for application and further development, and can be found at https://github.com/clarafrydman/Spanish_Psych_Phenotyping. Further inquiries can be directed to the corresponding author.

1.
Ford
E
,
Carroll
JA
,
Smith
HE
,
Scott
D
,
Cassell
JA
.
Extracting information from the text of electronic medical records to improve case detection: a systematic review
.
J Am Med Inform Assoc
.
2016
;
23
(
5
):
1007
15
.
2.
Assale
M
,
Dui
LG
,
Cina
A
,
Seveso
A
,
Cabitza
F
.
The revival of the notes field: leveraging the unstructured content in electronic health records
.
Front Med
.
2019
;
6
:
66
.
3.
Forbush
TB
,
Gundlapalli
AV
,
Palmer
MN
,
Shen
S
,
South
BR
,
Divita
G
, et al
.
“Sitting on pins and needles”: characterization of symptom descriptions in clinical notes
.
AMIA Jt Summits Transl Sci Proc
.
2013
;
2013
:
67
71
.
4.
Bejan
CA
,
Ripperger
M
,
Wilimitis
D
,
Ahmed
R
,
Kang
J
,
Robinson
K
, et al
.
Improving ascertainment of suicidal ideation and suicide attempt with natural language processing
.
Sci Rep
.
2022
;
12
(
1
):
15146
.
5.
Perlis
RH
,
Iosifescu
DV
,
Castro
VM
,
Murphy
SN
,
Gainer
VS
,
Minnier
J
, et al
.
Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model
.
Psychol Med
.
2012
;
42
(
1
):
41
50
.
6.
Denny
JC
,
Ritchie
MD
,
Basford
MA
,
Pulley
JM
,
Bastarache
L
,
Brown-Gentry
K
, et al
.
PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations
.
Bioinformatics
.
2010
;
26
(
9
):
1205
10
.
7.
Jensen
PB
,
Jensen
LJ
,
Brunak
S
.
Mining electronic health records: towards better research applications and clinical care
.
Nat Rev Genet
.
2012
;
13
(
6
):
395
405
.
8.
Smoller
JW
.
The use of electronic health records for psychiatric phenotyping and genomics
.
Am J Med Genet B Neuropsychiatr Genet
.
2018
;
177
(
7
):
601
12
.
9.
All of Us Research Program Investigators
;
Denny
JC
,
Rutter
JL
,
Goldstein
DB
,
Philippakis
A
,
Smoller
JW
, et al
.
The “all of us” research program
.
N Engl J Med
.
2019
;
381
(
7
):
668
76
.
10.
Barr
PB
,
Bigdeli
TB
,
Meyers
JL
.
Prevalence, comorbidity, and sociodemographic correlates of psychiatric disorders reported in the all of us research program
.
JAMA Psychiatry
.
2022
;
79
(
6
):
622
8
.
11.
Fabbri
C
,
Hagenaars
SP
,
John
C
,
Williams
AT
,
Shrine
N
,
Moles
L
, et al
.
Genetic and clinical characteristics of treatment-resistant depression using primary care records in two UK cohorts
.
Mol Psychiatry
.
2021
;
26
(
7
):
3363
73
.
12.
Insel
T
,
Cuthbert
B
,
Garvey
M
,
Heinssen
R
,
Pine
DS
,
Quinn
K
, et al
.
Research domain criteria (RDoC): toward a new classification framework for research on mental disorders
.
Am J Psychiatry
.
2010
;
167
(
7
):
748
51
.
13.
Kotov
R
,
Krueger
RF
,
Watson
D
,
Achenbach
TM
,
Althoff
RR
,
Bagby
RM
, et al
.
The Hierarchical Taxonomy of Psychopathology (HiTOP): a dimensional alternative to traditional nosologies
.
J Abnorm Psychol
.
2017
;
126
(
4
):
454
77
.
14.
McCoy
TH
,
Castro
VM
,
Hart
KL
,
Pellegrini
AM
,
Yu
S
,
Cai
T
, et al
.
Genome-wide association study of dimensional psychopathology using electronic health records
.
Biol Psychiatry
.
2018
;
83
(
12
):
1005
11
.
15.
Jackson
RG
,
Patel
R
,
Jayatilleke
N
,
Kolliakou
A
,
Ball
M
,
Gorrell
G
, et al
.
Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project
.
BMJ Open
.
2017
;
7
(
1
):
e012012
.
16.
Irving
J
,
Patel
R
,
Oliver
D
,
Colling
C
,
Pritchard
M
,
Broadbent
M
, et al
.
Using natural language processing on electronic health records to enhance detection and prediction of psychosis risk
.
Schizophr Bull
.
2021
;
47
(
2
):
405
14
.
17.
Tsui
FR
,
Shi
L
,
Ruiz
V
,
Ryan
ND
,
Biernesser
C
,
Iyengar
S
, et al
.
Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts
.
JAMIA Open
.
2021
;
4
(
1
):
ooab011
.
18.
Bayramli
I
,
Castro
V
,
Barak-Corren
Y
,
Madsen
EM
,
Nock
MK
,
Smoller
JW
, et al
.
Predictive structured–unstructured interactions in EHR models: a case study of suicide prediction
.
NPJ Digit Med
.
2022
;
5
:
15
1
.
19.
Garriga
R
,
Buda
TS
,
Guerreiro
J
,
Omaña Iglesias
J
,
Estella Aguerri
I
,
Matić
A
.
Combining clinical notes with structured electronic health records enhances the prediction of mental health crises
.
CR Med
.
2023
;
4
(
11
):
101260
.
20.
López-Úbeda
P
,
Plaza-del-Arco
FM
,
Díaz-Galiano
MC
,
Martín-Valdivia
MT
.
How successful is transfer learning for detecting anorexia on social media
.
Appl Sci
.
2021
;
11
(
4
):
1838
.
21.
Lopez-Garcia
G
,
Jerez
JM
,
Ribelles
N
,
Alba
E
,
Veredas
FJ
.
Transformers for clinical coding in Spanish
.
IEEE Access
.
2021
;
9
:
72387
97
.
22.
Solarte-Pabón
O
,
Montenegro
O
,
García-Barragán
A
,
Torrente
M
,
Provencio
M
,
Menasalvas
E
, et al
.
Transformers for extracting breast cancer information from Spanish clinical narratives
.
Artif Intell Med
.
2023
;
143
:
102625
.
23.
Cusick
M
,
Velupillai
S
,
Downs
J
,
Campion
TR
,
Sholle
ET
,
Dutta
R
, et al
.
Portability of natural language processing methods to detect suicidality from clinical text in US and UK electronic health records
.
J Affect Disord Rep
.
2022
;
10
:
100430
.
24.
Wu
S
,
Miller
T
,
Masanz
J
,
Coarr
M
,
Halgrim
S
,
Carrell
D
, et al
.
Negation’s not solved: generalizability versus optimizability in clinical natural language processing
.
PLoS One
.
2014
;
9
(
11
):
e112774
.
25.
Carrell
DS
,
Schoen
RE
,
Leffler
DA
,
Morris
M
,
Rose
S
,
Baer
A
, et al
.
Challenges in adapting existing clinical natural language processing systems to multiple, diverse health care settings
.
J Am Med Inform Assoc
.
2017
;
24
(
5
):
986
91
.
26.
Fu
S
,
Leung
LY
,
Raulli
A-O
,
Kallmes
DF
,
Kinsman
KA
,
Nelson
KB
, et al
.
Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction
.
BMC Med Inform Decis Mak
.
2020
;
20
(
1
):
60
.
27.
Fu
S
,
Chen
D
,
He
H
,
Liu
S
,
Moon
S
,
Peterson
KJ
, et al
.
Clinical concept extraction: a methodology review
.
J Biomed Inform
.
2020
;
109
:
103526
.
28.
Liu
S
,
Wen
A
,
Wang
L
,
He
H
,
Fu
S
,
Miller
R
, et al
.
An open natural language processing (NLP) framework for EHR-based clinical research: a case demonstration using the National COVID Cohort Collaborative (N3C)
.
J Am Med Inform Assoc
.
2023
;
30
(
12
):
2036
40
.
29.
Névéol
A
,
Dalianis
H
,
Velupillai
S
,
Savova
G
,
Zweigenbaum
P
.
Clinical natural language processing in languages other than English: opportunities and challenges
.
J Biomed Semantics
.
2018
;
9
(
1
):
12
.
30.
Wu
H
,
Wang
M
,
Wu
J
,
Francis
F
,
Chang
Y-H
,
Shavick
A
, et al
.
A survey on clinical natural language processing in the United Kingdom from 2007 to 2022
.
NPJ Digit Med
.
2022
;
5
:
186
15
.
31.
Kulkarni
D
,
Ghosh
A
,
Girdhari
A
,
Liu
S
,
Vance
LA
,
Unruh
M
, et al
.
Enhancing pre-trained contextual embeddings with triplet loss as an effective fine-tuning method for extracting clinical features from electronic health record derived mental health clinical notes
.
Nat Lang Process J
.
2024
;
6
:
100045
.
32.
Vance
LA
,
Way
L
,
Kulkarni
D
,
Palmer
EOC
,
Ghosh
A
,
Unruh
M
, et al
.
Natural language processing to identify suicidal ideation and anhedonia in major depressive disorder
.
BMC Med Inform Decis Mak
.
2025
;
25
(
1
):
20
.
33.
Eyre
H
,
Chapman
AB
,
Peterson
KS
,
Shi
J
,
Alba
PR
,
Jones
MM
, et al
.
Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python
.
AMIA Annu Symp Proc
.
2021
;
2021
:
438
47
.
35.
Service
SK
,
Vargas Upegui
C
,
Castaño Ramírez
M
,
Port
AM
,
Moore
TM
,
Munoz Umanes
M
, et al
.
Distinct and shared contributions of diagnosis and symptom domains to cognitive performance in severe mental illness in the Paisa population: a case-control study
.
Lancet Psychiatry
.
2020
;
7
(
5
):
411
9
.
36.
Boltz
TA
,
Chu
BB
,
Liao
C
,
Sealock
JM
,
Ye
R
,
Majara
L
, et al
.
A blended genome and exome sequencing method captures genetic variation in an unbiased, high-quality, and cost-effective manner
.
bioRxiv [Preprint]
.
2024
.
37.
Ramirez-Diaz
AM
,
Diaz-Zuluaga
AM
,
Stroud
RE
,
Vreeker
A
,
Bitta
M
,
Ivankovic
F
, et al
.
Phenotype harmonization and analysis for the populations underrepresented in mental illness association studies (the PUMAS project)
.
medRxiv [Preprint]
.
2024
.
38.
Neves
M
,
Ševa
J
.
An extensive review of tools for manual annotation of documents
.
Brief Bioinform
.
2021
;
22
(
1
):
146
63
.
39.
Eckart de Castilho
R
,
Mújdricza-Maydt
É
,
Yimam
SM
,
Hartmann
S
,
Gurevych
I
,
Frank
A
, et al
.
A web-based tool for the integrated annotation of semantic and syntactic structures
. In:
Hinrichs
E
,
Hinrichs
M
,
Trippel
T
, editors.
Proceedings of the workshop on language technology resources and tools for digital humanities (LT4DH)
.
Osaka, Japan
:
The COLING 2016 organizing committee
;
2016
. p.
76
84
.
40.
Harkema
H
,
Dowling
JN
,
Thornblade
T
,
Chapman
WW
.
ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports
.
J Biomed Inform
.
2009
;
42
(
5
):
839
51
.
41.
Cohen
J
.
A coefficient of agreement for nominal scales
.
Educ Psychol Meas
.
1960
;
20
(
1
):
37
46
.
42.
Montani
I
,
Honnibal
M
,
Boyd
A
,
Van Landeghem
S
,
Peters
H
.
spaCy: industrial-strength natural language processing in Python
;
2023
.
43.
S
,
Josse
J
,
Husson
F
.
FactoMineR: an R package for multivariate analysis
.
J Stat Softw
.
2008
;
25
:
1
18
.
44.
Velupillai
S
,
Suominen
H
,
Liakata
M
,
Roberts
A
,
Shah
AD
,
Morley
K
, et al
.
Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances
.
J Biomed Inform
.
2018
;
88
:
11
9
.
45.
American Psychiatric Association, American Psychiatric Association
.
Diagnostic and statistical manual of mental disorders: DSM-5
. 5th ed.
Washington, DC
:
American Psychiatric Association
;
2013
.
46.
Samsi
S
,
Zhao
D
,
McDonald
J
,
Li
B
,
Michaleas
A
,
Jones
M
, et al
.
From words to watts: benchmarking the energy costs of large language model inference
. 2023 IEEE high performance extreme computing conference (HPEC).
2023
. p.
1
9
.
47.
Chitty-Venkata
KT
,
Mittal
S
,
Emani
M
,
Vishwanath
V
,
Somani
AK
.
A survey of techniques for optimizing transformer inference
.
J Syst Archit
.
2023
;
144
:
102990
.
48.
Shen
Y
,
Heacock
L
,
Elias
J
,
Hentel
KD
,
Reig
B
,
Shih
G
, et al
.
ChatGPT and other large language models are double-edged swords
.
Radiology
.
2023
;
307
(
2
):
e230163
.
49.
Ji
Z
,
Lee
N
,
Frieske
R
,
Yu
T
,
Su
D
,
Xu
Y
, et al
.
Survey of hallucination in natural language generation
.
ACM Comput Surv
.
2023
;
55
(
12
):
1
38
.
50.
Yang
X
,
Chen
A
,
PourNejatian
N
,
Shin
HC
,
Smith
KE
,
Parisien
C
, et al
.
A large language model for electronic health records
.
NPJ Digit Med
.
2022
;
5
:
194
9
.
51.
Qiu
J
,
Li
L
,
Sun
J
,
Peng
J
,
Shi
P
,
Zhang
R
, et al
.
Large AI models in health informatics: applications, challenges, and the future
.
IEEE J Biomed Health Inform
.
2023
;
27
(
12
):
6074
87
.
52.
Guevara
M
,
Chen
S
,
Thomas
S
,
Chaunzwa
TL
,
Franco
I
,
Kann
BH
, et al
.
Large language models to identify social determinants of health in electronic health records
.
NPJ Digit Med
.
2024
;
7
(
1
):
6
.
53.
Li
L
,
Zhou
J
,
Gao
Z
,
Hua
W
,
Fan
L
,
Yu
H
, et al
.
A scoping review of using large language models (LLMs) to investigate electronic health records (EHRs)
.
2024
.
54.
Han
S
,
Zhang
RF
,
Shi
L
,
Richie
R
,
Liu
H
,
Tseng
A
, et al
.
Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing
.
J Biomed Inform
.
2022
;
127
:
103984
.
55.
Schinkel
M
,
Boerman
AW
,
Paranjape
K
,
Wiersinga
WJ
,
Nanayakkara
PWB
.
Detecting changes in the performance of a clinical machine learning tool over time
.
EBioMedicine
.
2023
;
97
:
104823
.
56.
Rahmani
K
,
Thapa
R
,
Tsou
P
,
Casie Chetty
S
,
Barnes
G
,
Lam
C
, et al
.
Assessing the effects of data drift on the performance of machine learning models used in clinical sepsis prediction
.
Int J Med Inform
.
2023
;
173
:
104930
.
57.
Hansen
L
,
Enevoldsen
K
,
Bernstorff
M
,
Perfalk
E
,
Danielsen
AA
,
Nielbo
KL
, et al
.
Lexical stability of psychiatric clinical notes from electronic health records over a decade
.
Acta Neuropsychiatr
.
2023
;
37
:
e16
11
.
58.
De la Hoz
JF
,
Arias
A
,
Service
SK
,
Castaño
M
,
Diaz-Zuluaga
AM
,
Song
J
, et al
.
Characterization of serious mental illness trajectories through transdiagnostic clinical features
.
Br J Psychiatry
.
2025
;
1
8
.