- Split View
-
Views
-
Cite
Cite
Colin G Walsh, Beenish Chaudhry, Prerna Dua, Kenneth W Goodman, Bonnie Kaplan, Ramakanth Kavuluru, Anthony Solomonides, Vignesh Subbian, Stigma, biomarkers, and algorithmic bias: recommendations for precision behavioral health with artificial intelligence, JAMIA Open, Volume 3, Issue 1, April 2020, Pages 9–15, https://doi.org/10.1093/jamiaopen/ooz054
- Share Icon Share
Abstract
Effective implementation of artificial intelligence in behavioral healthcare delivery depends on overcoming challenges that are pronounced in this domain. Self and social stigma contribute to under-reported symptoms, and under-coding worsens ascertainment. Health disparities contribute to algorithmic bias. Lack of reliable biological and clinical markers hinders model development, and model explainability challenges impede trust among users. In this perspective, we describe these challenges and discuss design and implementation recommendations to overcome them in intelligent systems for behavioral and mental health.
INTRODUCTION
Artificial intelligence (AI), a rich area of research for decades, has attracted unprecedented attention in healthcare in the past few years. Academics and industry collaborators apply AI to a variety of biomedical issues ranging from clinical prediction to phenotyping complex disease states, or for guiding diagnosis, prognosis, treatment, and lifestyle change.1–11 While public perceptions of AI center on strong or artificial general intelligence (the ability for smart agents to think as humans do), most if not all published efforts in biomedicine focus on weak or applied AI.
Applied AI (subsequent mentions of “AI” in this piece will refer to applied or weak AI), from complex multivariate models to simple clinical prediction rules, has been a mainstay in prediction of hospital readmissions,12 acute kidney injury,13 mortality,14,15 and imaging (eg, retinal imaging16,17 or radiology18) for over a decade. But it has more recently been applied to challenges in mental and behavioral health (eg, predicting suicide,19 treatment resistance in depression,20 dementia,21 and more). Behavioral health includes emotional, mental, and social factors as well as behaviors to prevent illness (eg, avoiding substance abuse) and promote wellness (eg, exercise).22 Since we do not yet live in a world where behavioral healthcare is simply “healthcare” as we hope will one day be the case, informaticians must be attuned to the ways in which mental and behavioral health differ from other areas in medicine. Failure to do so leads to unintended consequences and potential harms or, at best, the most common fate for published predictive models: that they are never used in clinical practice.
To help the informatics community reach the potential for AI to impact behavioral healthcare, we will discuss issues either unique to or exemplified by behavioral health. We will then share recommendations for designing and deploying intelligent systems in this domain.
CHALLENGES
Behavioral health poses uncommon challenges to designing impactful AI. Broadly, these challenges include (1) lack of data because of (i) stigma and silence (ie, under-reporting, under-coding) and (ii) lack of or unreliable biomarkers; (2) algorithmic biases; and (3) danger of inappropriate use due to gaps in interpretability or explainabilty, trust, and privacy concerns.
Lack of data
Under-reporting and under-coding
One in five adult Americans (∼43.8 million) experience a mental health disorder in any given year, regardless of race, religion, gender, or socioeconomic status.23 Behavioral health issues like abuse of tobacco, alcohol, and illicit drugs account for ∼$232 billion in healthcare costs annually.24 However, approximately two-thirds of those with mental illness suffer privately, without treatment.25 Stigma, both self-directed and public, contributes to this dilemma. Self-stigma feeds self-discriminating and stereotyping behavior with negative professional and personal consequences.26 Public stigma leads to restricted opportunities, coercive treatment, and reduced independence for individuals with mental and behavioral health conditions. Social stigma, for example, for opioid use disorders, can have implications for public health and punitive policy-making.27
Silence leads to under-reporting, but under-coding exacerbates this gap. Under-coding is particularly common in primary care and in patients presenting with multiple co-morbidities.28,29 For example, in patients presenting with both mental illness and a chronic condition, clinicians are more likely to code and claim for just the chronic condition.29 Even when documented, behavioral health symptoms might not be recorded in structured forms. For example, suicidal thoughts are only coded 3% of the time in primary care even when documented in notes.30 Qualifying words such as “likely” or “suspected” soften firm diagnoses. Coding suicidal ideation or severe symptoms might raise administrative (eg, expectation of triggering alerts downstream) or liability concerns for providers,30 even if they spend sufficient time assessing and planning an effective management plan with those patients.
Unreliable or absent biomarkers and objective measures
Unlike other illnesses such as congestive heart failure or sepsis,31,32 mental illness or behavioral health concerns are not directly diagnosed via objective measures, laboratory reports or other quantitative biomarkers. Recent trends suggest this fact might change, such as a study linking heart rate-related metrics to post-traumatic stress disorder (PTSD).33 Instead, diagnoses result from medical history, general physical examination findings—anxiety or nervousness and a thorough psychiatric exam—and provider impressions. Often, these potentially predictive data are either recorded only in unstructured data such as text or in covert forms, for example, as text about “trouble sleeping” without overt documentation of insomnia related to depression. As a result, algorithms reliant on readily available structured data might fail to incorporate diagnostic or prognostic risk factors.
Attempts to incorporate unstructured text via natural language processing (NLP) in behavioral health have been published.34–40 However, the tradeoff between sensitivity and specificity is particularly challenging here, because of the well-known problems of dimensionality and negation. Adding large numbers of NLP predictors (eg, bag of words or word2vec41) to models adds to dimensionality and potential for overfitting. Moreover, the prevalence of clinical screening in practice and documentation of negative assertions means that basic NLP or regular expressions might fail. For instance, most documentation of suicidal thoughts in electronic health records (EHRs) describe when risk isn't present (eg, screening), not when it is.
Potential for algorithmic bias
All algorithms based in AI generally involve bias, but not in the sense familiar to the public.42 High bias in algorithms implies missing important relationships between the input features and output variables (referred to as underfitting). A related concept, high variance, implies that models learn almost all data points in given training data but fail to perform well on new data (referred to as overfitting). Reducing bias or variance tends to increase the other quantity. This bias-variance tradeoff defines model performance.43 A rich literature in AI includes representational and procedural bias in the interest of generalizing algorithmic learning beyond “strict consistency with individual instances.”44 The media and public perceive a broader definition of bias as prejudice against a person, group, race, or culture, which we refer to as “algorithmic bias” here, as others have.45–49
Health disparities contribute to algorithmic bias. Mental illnesses such as schizophrenia and affective disorders are likely to be misdiagnosed, particularly among African Americans and lower socioeconomic groups.50 Women have higher prevalence of major depression, anxiety disorders, and PTSD,51 whereas Native Americans have disproportionate rates of suicide-associated deaths.52 Cultural differences in language and mannerisms, difficulties relating between patients and therapists of sufficiently different backgrounds, and prevailing societal notions about various groups’ susceptibility to mental illness add to algorithmic bias.53 While AI might be well-suited for the diagnosis and prognosis in complex, nuanced phenotypes like these, we risk producing models that incorporate bias in underlying data54 (eg, lack of nonbinary gender categories in EHRs) and algorithmic bias in model specification. Finally, model developers rarely have “ground truth” to use for validation and model training. The dependent variables and “gold standards” might also rely on expert review or chart validation and might be flawed. A final critical issue is that of a harmful feedback loop: existing disparities may lead to unrepresentative training data. This bias may seep into predictive models, which further exacerbate disparities owing to biased predictions for certain minorities and vulnerable segments of patient populations.
Considerable scholarship discusses algorithmic bias: in data, in model specification, in deployment and use, and, if machine-learning was involved, in model training and its trainers.55–57 Robust discussion includes the need for data sharing and re-use for transparency in how algorithms work, their accuracy and reliability,52 the explainability of their conclusions, and accountability for using or not using them. In health care, other incentives might also influence how data are recorded or interpreted. Reimbursement and billing, social or employment consequences, or other financial and stigma avoidance strategies could bias what information is collected and how it is recorded, as well as the output of algorithms processing patient data.
Inappropriate use, interpretability/explainability, and trust
We have learned from ethical analyses of biomedical informatics literature that appropriate users and uses of technology are often identified based on potential to improve care.51,58,59 If an algorithm contributes positively to a patient’s treatment, then that is a good reason to use it. If it harms or does not help, then we should be hard-pressed to justify its use. In some cases, empirical research can help answer these questions.
Investigating disruptions in behavioral health requires linking data from multiple levels, from cells to the individual to the environment. Furthermore, a single biological disturbance may produce two different psychological issues and, conversely, two different neurological disturbances may produce the same or similar psychological symptoms.60 This complexity makes interpretability, “how” a model arrives at an output, and explainability, “why” a model arrives at that output, more complex. The literature in explainable AI has expanded in recent years.61–64 We highlight that explainability and interpretability are particularly important in behavioral health because we have so few complex models in behavioral healthcare delivery unlike, for example, models predicting readmissions or sepsis. Thus, we have not reached an inflection point where users “trust” that AI provides accurate recommendations even if the process that led to them cannot be interrogated.
Generating explanations to interpret results from a model is critical for most conditions of interest. Clinicians rightly crave actionable insights at the time of decision-making in line with the “Five Rights” of decision support (the right information, delivered to the right person, in the right intervention format, through the right channel, and at the right time in workflow).65 But models derived from large complex datasets are harder to interpret.65 With complex nonlinear models such as deep neural networks, the task of generating explanations is non-trivial, especially in the context of EHRs that contain a mixture of structured and textual data.
Because many outcomes in behavioral and mental health might be clinically rare yet have very high stakes, end-users must also be given appropriate context in which to interpret predictions. For example, many published predictive models of suicide risk show high sensitivity at the expense of low precision.66 Preventive efforts might be wasted on large numbers of false positives secondary to imprecise models. At the same time, false negatives might lead to loss of life from suicide and loss of trust in automated recommendations. The clinical harms of such events are further compounded by liability and legal implications.
A corollary challenge relates to relative inattention to calibration performance of predictive models in favor of discrimination performance (eg, c-statistics, sensitivity, and specificity). A recent systematic review showed 79% (56) of included studies did not address calibration in their results. If an outcome occurs 0.2% of the time in one clinic, a 20% predicted risk is quite high, but a clinician not educated in this interpretation might not prioritize this number without proper context.67 Failure to account for and educate end-users such as clinical providers about these issues will compromise trust in algorithmic recommendations as well as uptake.
Attempts to hybridize “black box” prediction with more readily interpretable algorithms are underway.58,68 We highlight this challenge as algorithms in behavioral health have high potential to be care-altering, career-altering (eg, employment or military deployment decisions), or life-altering. Providers might feel compelled to respond to a “high risk” designation for suicide risk beyond, for example, readmission risk. Thus, the onus remains on informaticians to forge trust with end-users (ie, clinicians and patients) in demonstrating the reasoning behind the recommendations made by algorithms.
RECOMMENDATIONS
Despite challenges in implementing AI for behavioral health, appropriate effort to overcome them supports continued innovation in this domain. Our recommendations follow to best integrate intelligent systems alongside humans to augment, and not replace, what people do best: taking a broad view, exercising judgment, and accepting moral responsibility.
Foster trust through education and transparency
The issue of trust can be addressed at both the community-level and the technology-level. Ample literature focuses on model development, validation, and implementation. Far less focuses on providing tools and knowledge to noninformatics clinical providers on how to best integrate risk prediction into practice. For providers to better judge algorithmic outputs, designers and informaticians should contextualize and educate the broader community about how to assess, integrate, and evaluate clinically applied AI. An AI-educated practitioner will also be far more likely to notice errors or potentially harmful edge cases before bad outcomes can occur.
George E.P. Box is famously paraphrased for “All models are wrong, some are useful.”69 In 2019, we might amend that statement to say, “All models are wrong, some are useful, some might be dangerous.” We need to make clear to patients, providers, and healthcare leaders that predictive models will sometimes misclassify risk and that unintended consequences will result. Systems that permit transparency to this fact and to the factors that contribute to prediction are critical. At minimum, appropriate uncertainty quantification or calibration methods should deliver predictions that quantify risk in an actionable manner while accounting for changes in outcome prevalence, input data, and their relationships (a process known as “drift”) over time.59,70
At the technology-level, systems should be designed to elicit providers' trust. Zuboff found that trust in a new technology depends on trial-and-error experience, followed by understanding of the technology's operation, and finally, faith.71 To foster users’ trust, users must have channels to disagree with recommendations such that algorithms (and their designers) can learn from these disagreements. Providers should not fear negative consequences for trusting their own clinical judgment—a safe, collaborative culture remains a key element to achieve this end. We recommend decision support systems permit users to share elements of their decision-making not included in algorithmic design or those elements that explain why users do not follow decision support recommendations.
Leverage determinants to address algorithmic bias
Behavioral and mental health conditions correlate with social determinants of health (eg, employment status; lesbian, gay, bisexual, transgender, and queer (LGBTQ) identification; and marital status), which may only be recorded in notes.72–75 Unsupervised NLP methods can identify “homelessness” and “adverse childhood events” at scale in clinical text,72 but few centers are able to integrate it into care delivery. Scalable NLP to make unstructured clinical text as readily available as structured diagnostic codes is needed to further catalyze behavioral health informatics research and operations. It would concomitantly increase capture of critical biopsychosocial determinants of health.
Performance of an algorithm may pass general thresholds set to be used in practice for alerts and risk assessment but may perform poorly for specific demographic segments. If so, care must be exercised when using it. Collecting additional data from these populations for retraining models might be an effective means to build fairer models without sacrificing accuracy.71 Care should be taken to ensure that models do not discriminate in risk assessments regardless of demographic segment size or prevalence rates. We recommend providing guidance to interpretability or analytic similarity at the time of clinical decision-making to make transparent how similar a particular patient’s demographics might be to the algorithmic training cohort, akin to efforts to display similarity of clinical trial participants to our communities using census health indicator data.76
Encourage interdisciplinary collaborations
AI in healthcare perches near the peak of the hype cycle. To speed its descent through the trough of disillusionment to the plateau of productivity, we must partner across disciplines. Unprecedented willingness to combine expertise in informatics, psychology, psychiatry, healthcare delivery, engineering, and more, have stimulated excitement around AI in healthcare and in other aspects of our lives. However, to avoid predictive models that never reach implementation or clinical use, clinical processes should be linked to the nascent stages of model development. Retrospective validation remains an accepted initial step in AI development and often relies on relatively accessible resources to complete. Transitioning to prospective use in clinical practice requires different study design, for example, pragmatic clinical trial and ongoing evaluation, as well as ongoing commitment from clinical and operational partners.
To improve interpretability, models should also be able to identify attributes in the patient records that have contributed to predictions or recommendations generated.33 A focus on actionable, modifiable risk factors will convert prognostic models to predictive models that not only suggest risk of a future event but also those risks based on potential interventions that might be made right now.77 For example, a predictive model might direct provider behavior in measurable and impactful ways if it suggests that reducing polypharmacy might lower downstream risk of an adverse drug event by a 10%, not just the presence of polypharmacy.
Augment the human elements
Another significant consideration relates to roles best played by humans and those by machines. Though it applies to any use of technology in healthcare, behavioral health in particular involves delicacy of interactions during times of crisis.
Since the advent ELIZA, it has been well-established that patients report symptoms more readily to digital or intelligent agents, such as chat-bots, than to humans, especially for behavioral health concerns.25,26 On the other hand, interacting with vulnerable individuals requires skill and sensitivity not generally attributed to computers, however sophisticated. Therefore, more recently the push has been toward integrating intelligent agents with human providers. For instance, the AI algorithm of Crisis Text Line, a national nonprofit, uses two text responses of users contemplating suicide to triage them to a live counselor.26 However, these systems are rarely linked to EHRs and routine clinical care at medical centers, so the onus remains on patients to report and providers to ask about these exchanges at the next clinical encounter.
Based on this evidence, we make the following suggestions to address under-reporting and under-coding. When designed in the context of behavioral health, AI models provide an opportunity to address issues such as under-coding, for example, through mining relationships between various data categories (eg, labs, medications, and diagnosis).78
Access to mental health expertise and allocation of a precious resource—consultation from mental health specialists—remain major challenges in healthcare around the world.79 In the short-term, AI approaches to allocate such resources optimally and to queue the appropriate next patients for consultation by busy providers are key steps to begin proving clinical efficacy of intelligent systems in this area.
We emphasize the need to improve ascertainment of both predictors and phenotypes not well captured in structured or objective measurements. For example, we again note the need to augment NLP at the point-of-care. For example, basic sentiment analysis of discharge notes alone improves prediction of suicidality.36 A behavioral health crisis might not be explicitly coded at the time of billing but might be well-described in clinical text. Intelligent agents that analyze clinical text in production, even clinical messages in patient portals, might improve our ability to identify patients in times of need in the same way we receive alerts for a creatinine lab test that has dramatically increased to surveil for signs of acute renal failure.
An over-arching question in this domain asks whether we can trust intelligent counselors or whether they should always be human. Though diagnostic criteria are debated and findings subject to interpretation and negotiation, humans remain more likely to provide better, and more humane, outcomes for patients, at least for now.80 The lack of clearly established markers or measurable objective markers for some conditions in behavioral health further complicates this matter and reminds us about the importance of human judgment. If human comfort and sympathetic touch is called for—and would a machine be able to tell?—it still would be up to the healthcare practitioner to provide it.
WHAT’S MISSING
Many key themes were out of scope for this work. Mobile health applications for communication, activity tracking, meditation, and much more transform daily life for millions and are increasingly used in large-scale data collection, including behavioral health conditions.81 Telemedicine has waxed and waned over the past decades but has unprecedented purchase in healthcare today. Telepsychiatry and telepsychology are potentially potent care delivery mechanisms on their own and stand to be enhanced through appropriate use of predictive models. We touched on ethical and privacy concerns, which are developed more fully elsewhere but still need further attention.82–87 Finally, a growing body of press and literature outline concerns around commercial use and public–private partnerships involving clinical data and in particular mental health data via the app ecosystem, data aggregation, and others.88–91 We highlight this important area that remains in need of further inquiry and empirical research.
CONCLUSIONS
The issues above help shape AI’s potential in healthcare. Though consequences may be starker in behavioral and mental health, they deserve attention for all areas of AI in medicine. Inattention to them contributes to the most common fate for published predictive models—they are rarely translated into clinical practice.92 Some models are appropriately evaluated, tested operationally, and not deployed, but many never reach that point.
Because behavioral health poses key informatics challenges, our recommendations are intended to catalyze further discussion. We have achieved our current state of predictive technology in behavioral health through close collaboration across disciplines. Rigorous, prospective evaluation is necessary to ensure outcomes improve with minimum unintended consequences. We should address these challenges to protect and improve quality of life and to improve mental and behavioral health through these same means.
FUNDING
Authors’ effort were partially supported by the following grants: under grant # W81XWH-10-2-0181 and R01 MH116269-01 (CGW), the National Institute of General Medical Sciences of the National Institutes of Health under grant #P20 GM103424-17 (PD); U.S. National Center for Advancing Translational Sciences via grant #UL1TR001998 (RK); National Science Foundation under grant #1838745 (VS).
AUTHOR CONTRIBUTIONS
CGW led planning and drafting of this work. VS initiated the work and formed the team. The remaining authors contributed equally to manuscript planning, drafting, and revision, and therefore are listed in alphabetical order. This work is a collaborative effort between the AMIA Ethical, Legal, and Social Issues Working Group and the Mental Health Informatics Working Group.
CONFLICT OF INTEREST
None declared.
REFERENCES
Centers for Medicare and Medicaid Services (CMS), Substance Abuse and Mental Health Services Administration (SAMHSA). A Roadmap to Behavioral Health – A Guide to Using Mental Health and Substance Use Disorder Services. CMS.gov Consumer Resources:25.
Mental Health by the Numbers | NAMI: National Alliance on Mental Illness. https://www.nami.org/learn-more/mental-health-by-the-numbers. Accessed March 6, 2019.
National Institute on Drug Abuse. Trends & Statistics.
Corrigan PW, Kleinlein P. The Impact of Mental Illness Stigma. In P. W. Corrigan ed. On the stigma of mental illness: Practical strategies for research and social change. Washington, DC, US: American Psychological Association; 2005:11–44.
Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, 2016: 3504–3512.
Mitchell, TM. The need for biases in learning generalizations. New Jersey: Department of Computer Science, Laboratory for Computer Science Research, Rutgers University; 1980.
Leavitt RA, Ertl A, Sheats K, et al. Suicides Among American Indian/Alaska Natives — National Violent Death Reporting System, 18 States, 2003–2014. MMWR Morb Mortal Wkly Rep 2018; 67: 237–42.
Ferryman K, Pitcan M. Fairness in precision medicine. Data & Society Research Institute, 2018; 54.
Doran D, Schulz S, Besold TR. What does explainable AI really mean? A new conceptualization of perspectives. arXiv preprint arXiv: 1710.00794; 2017.
Gunning D, Aha D. DARPA's Explainable Artificial Intelligence (XAI) Program. AI Magazine, 2019; 40 (2): 44–58.
Holzinger A, Biemann C, Pattichis CS, Kell DB. What do we need to build explainable AI systems for the medical domain?. arXiv preprint arXiv:1712.09923; 2017.
Guidotti R, Monreale A, Ruggieri S, Turini F, Giannotti F, Pedreschi D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2019; 51 (5): 93
Osheroff, J, Teich J, Levick D, et al. Improving outcomes with clinical decision support: an implementer's guide. Chicago, IL: HIMSS Publishing; 2012.
McKernan LC, Lenert MC, Crofford LJ, Walsh CG. Outpatient Engagement and Predicted Risk of Suicide Attempts in Fibromyalgia. Arthritis Care Res (Hoboken). 2019;71 (9):1255–1263.
WHO. Mental Health: Massive Scale-Up of Resources Needed if Global Targets are to be Met. Geneva, Switzerland: WHO. http://www.who.int/mental_health/evidence/atlas/atlas_2017_web_note/en/. Accessed July 26, 2019.
Goodman KW, Cushman R, Miller RA. Ethics in Biomedical and Health Informatics: Users, Standards, and Outcomes. In: Shortliffe EH and Cimino JJ, eds. Biomedical Informatics—Computer Applications in Health Care and Biomedicine. New York: Springer; 2014:
Kaplan B, Ranchordás S. Alzheimer's and m-Health: Regulatory, Privacy, and Ethical Considerations. In: Hayre CM, Muller DJ, and Scherer MH, eds. Everyday Technologies in Healthcare. 1 ed. Boca Raton, FL: CRC Press,
McKernan LC, Clayton EW, Walsh CG. Protecting life while preserving liberty: ethical recommendations for suicide prevention with artificial intelligence. Front Psychiatry, 2018; 9: 650.
Tucker RP, Tackett MJ, Glickman D, et al. Ethical and Practical Considerations in the Use of a Predictive Model to Trigger Suicide Prevention Interventions in Healthcare Settings. Suicide Life Threat Behav. 2019; 49 (2): 382–92.
The Privacy Project. N. Y. Times. 2019.https://www.nytimes.com/series/new-york-times-privacy-project. Accessed September 25, 2019.
Mental Health Apps are Scooping up Sensitive Data. Will you Benefit? STAT. 2019. https://www.statnews.com/2019/09/20/mental-health-apps-capture-sensitive-data/. Accessed September 25, 2019.
Artificial Intelligence Can Complicate Finding the Right Therapist. STAT. 2019. https://www.statnews.com/2019/09/20/artificial-intelligence-tool-finding-mental-health-therapist/. Accessed September 25, 2019.