Abstract
Background: People living with overweight and obesity often experience weight-based stigmatization. Investigations of the prevalence and correlates of weight bias and evaluation of weight bias reduction interventions depend upon psychometrically-sound measurement. Our paper is the first to comprehensively evaluate the psychometric properties, use of people-first language within items, and suitability for use with various populations of available self-report measures of weight bias. Methods: We searched five electronic databases to identify English-language self-report questionnaires of weight bias. We rated each questionnaire's psychometric properties based on initial validation reports and subsequent use, and examined item language. Results: Our systematic review identified 40 original self-report questionnaires. Most questionnaires were brief, demonstrated adequate internal consistency, and tapped key cognitive and affective dimensions of weight bias such as stereotypes and blaming. Current psychometric evidence is incomplete for many questionnaires, particularly with regard to the properties of test-retest reliability, sensitivity to change as well as discriminant and structural validity. Most questionnaires were developed prior to debate surrounding terminology preferences, and do not employ people-first language in the items administered to participants. Conclusions: We provide information and recommendations for clinicians and researchers in selecting psychometrically sound measures of weight bias for various purposes and populations, and discuss future directions to improve measurement of this construct.
Introduction
Stigmatization of individuals with overweight and obesity occurs in healthcare, educational settings, the workplace and interpersonal relationships, and is associated with adverse physical and psychological health outcomes [1]. A growing body of research has examined the weight-biased attitudes that may underlie stigmatizing behavior. Weight bias research has been characterized by variability in both its conceptualization and measurement [2]. Because awareness of the reliability and validity of measures is critical to gauge their usefulness and appropriateness for various purposes, a description and systematic comparison of the psychometric properties of self-report questionnaires measuring weight bias is needed to help advance knowledge in this area.
Definitions and Concepts: What Is Weight Bias?
The terms ‘weight bias,' ‘weight stigma,' and ‘weight prejudice' have been used interchangeably to refer to negative attitudes and discrimination toward individuals based on their body weight. Weight bias may be understood within the ecological system in terms of its structural, interpersonal, and intrapersonal components, following the framework developed by Cook et al. [3]. Structural weight bias focuses on social forces and institutions, including mass media portrayals of individuals with obesity, and differential access to goods, services, and opportunities based on weight. For example, studies have reported negative consequences of weight bias on quality of healthcare [4]. Interpersonal weight bias refers to prejudice and discrimination occurring within dyadic and small-group interactions, and intrapersonal(i.e., internalized) weight bias refers to the stigmatized group's emotional, cognitive, and behavioral reactions to negative messages about their own abilities and intrinsic worth [5].
The present review focuses on self-report scales that assess attitudes toward people living with obesity. The constructs of interpersonal and internalized weight bias are conceptually similar, but differ with respect to the target of the attitude being measured. Because self-report measures of neither interpersonal nor internalized weight bias have been comprehensively evaluated, we chose to include both in our review. We excluded scales that assess experiences of weight-based stigmatization, which have been reviewed elsewhere [6].
Measurement of Weight Bias
Weight bias has been assessed through self-report questionnaires, figure and vignette ratings, experimental manipulations, field studies, implicit measures, and neuroimaging methods. Two reviews of these methods have highlighted their strengths and weaknesses, and discussed their suitability for answering various research questions [2,7]. This review provides a summary and critical evaluation of the psychometric properties of self-report weight bias questionnaires.
Psychometric evidence may help clinicians and researchers select instruments that validly measure an attribute. Psychometrics center on the concepts of i) reliability, or the degree to which measures taken by similar or parallel instruments, by different observers, or at different points in time yield the same or similar results; and ii) validity, or the degree to which variation in scores reflects variation in the construct of interest [8]. Reliability and validity are not static traits of instruments, but rather vary as a function of contexts, purposes, and populations [8]. Although researchers not uncommonly declare an instrument unconditionally valid or reliable, such statements are unwarranted [8]. Rather, test users can gather and examine many types of psychometric evidence to infer whether scores on an instrument are likely to be reliable and valid indicators of the construct of interest for the population being assessed, and with the user's purpose in mind.
Although three reviews on weight bias measurement have been published, only a minority of available weight bias questionnaires have been evaluated, and only in terms of a few psychometric properties. Ruggs et al. [2] reviewed eight of the most popular self-report questionnaires that assess weight bias, with a focus on internal consistency reliability. Morrison et al. [7] examined the internal consistency, dimensionality, and convergent validity of five of these questionnaires. Several aspects of reliability and validity (e.g., test-retest reliability, discriminant validity, sensitivity to change) and many additional questionnaires have not yet been evaluated. Two additional limitations exist in this area of research. First, although Ruggs et al. [2] highlighted the importance of using consistent terminology to describe weight status, the potential for questionnaire item wording to perpetuate weight bias via potentially stigmatizing language describing the target group has not been considered. Second, assessments of the suitability of different measures for specific populations are needed. To address these gaps in the literature, the present review provides an inventory of available self-report questionnaires of weight bias, evaluates their psychometric properties, terminology within questionnaire instructions and items and other characteristics, and provides recommendations to assist clinicians and researchers in selecting questionnaires.
Methods
Findings are reported in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement [9].
Search Strategy
A search was conducted for studies employing self-report questionnaires to examine attitudes about people with overweight or obesity. Our search strategy erred on the side of inclusivity. Keywords related to weight bias were identified by reviewing relevant studies and search strategies from previous reviews (e.g., [6,10]). After testing a preliminary search strategy, search terms were modified to ensure retrieval of known relevant studies.
The final search strategy employed 35 combinations of keywords synonymous with ‘weight bias' (see Appendix A for full MEDLINE search strategy, available at http://content.karger.com/ProdukteDB/produkte.asp?doi=475716). The search was limited to English-language articles. The electronic databases CINAHL, Embase, Medline, PsycINFO, and PubMed were searched from inception to December 2015.
Article Screening
We used reference management software to compile search results and screen studies for eligibility. In the first phase of screening, the first author examined the titles and, when necessary, abstracts. In the second phase of screening, the first author examined full texts to determine whether articles met inclusion criteria.
The following inclusion criteria were applied: i) published in English, ii) appeared in a peer-reviewed journal, iii) included an original questionnaire or subscale to assess biased attitudes toward people with obesity and/or overweight, and iv) contained three or more items to assess weight bias. Modified versions of questionnaires were retained as separate questionnaires if they had meaningfully different item content, tested a different construct, or had a different purpose than the original questionnaire (e.g., tailored for a specific setting or population group).
Articles were excluded if i) questionnaires assessed general appearance-related bias without specifying weight; ii) questionnaires employed figure rating scales, ratings of specific individuals portrayed through images, or written vignettes; iii) questionnaires assessed beliefs about the etiology or treatment of obesity (unless they contained three or more items that examined attitudes toward people with obesity); iv) the items assessing weight bias did not form a unitary scale with a total score that was analyzed; or v) questionnaires examined experiences of weight-based stigmatization, rather than attitudes toward people living with obesity. We retained questionnaires that assessed self-stigma and internalized weight bias.
Data Extraction and Analysis
We derived criteria for evaluating evidence of psychometric strength from Clark and Watson's [11] guidelines for scale development, from Haynes, Smith, and Hunsley's [8] guide to the psychometric foundations of clinical assessment, and from systematic reviews of questionnaires that assess related constructs (e.g., [12]). Each criterion represents an aspect of reliability or validity that should ideally be considered in scale development and selection. The first author rated each criterion as present (1) or absent (0); the more criteria a questionnaire met, the greater the evidence for its psychometric strength.
Internal consistency: the degree of intercorrelation among items that make up an instrument or subscale, as indicated by Cronbach's α; rated present if α ≥ 0.70 [8]. When multiple alphas were reported (e.g., for different samples or subscales), we based our rating on the highest reported value.
Test-retest reliability: the tendency of an instrument to produce similar scores in the same individuals on different occasions; rated present if r ≥ 0.70 over a period of several days to several weeks [8]. When multiple coefficients were reported, we based our rating on the highest reported coefficient.
Theoretical clarity: careful definition of the construct; rated present if authors clearly articulated their conceptualization of the construct guiding scale development.
Content validity: the extent to which a measure represents all facets of a given construct; rated present if authors described item development and selection procedures in reasonable detail.
Structural validity: the extent to which a scale's internal structure parallels the external structure of the target trait; rated present if factor analysis was performed.
Convergent validity: the degree to which the measure correlates with other measures of the same construct or theoretically related constructs; rated present if scores were significantly correlated with scores on ≥1 measure of a theoretically related construct.
Discriminant validity: the degree to which scores on measures of distinct constructs are uncorrelated or weakly correlated; rated present if correlations were examined between instrument scores and scores on ≥1 theoretically unrelated scale.
Sensitivity to change: evidence of construct validity is provided when an intervention targeting a construct affects scores; rated present if scores significantly changed following an intervention or experimental manipulation.
For each questionnaire, data from the original article describing scale development were extracted, including authors, year of publication, sample characteristics, target of attitudes being measured (e.g., adults with obesity), item number and type (e.g., Likert scale), sample items, and dimensions of weight bias that the questionnaire was developed to assess. We also examined scale instructions and items to extract weight-related terms used to describe the target group (e.g., fat), and assess whether scales employed people first language, operationalized as the use of post-modified nouns (e.g., people with obesity) rather than pre-modified nouns (e.g., obese people) in items.
To estimate the popularity of each questionnaire in subsequent research, we noted the number of times each article had been cited using Web of Science (April 22, 2016). In a final step, after identifying original questionnaires, we conducted Web of Science forward citation searches in November 2016 to identify subsequent studies that employed each questionnaire. We retained subsequent studies that provided evidence for psychometric properties beyond those addressed in original scale development articles and incorporated this additional evidence into our ratings accordingly.
Results
A PRISMA flow diagram appears in figure 1. 40 original or modified self-report questionnaires of weight bias from 36 studies met our inclusion criteria (table 1). Questionnaire popularity and length, item type, and sample characteristics are listed in Appendix B (available at http://content.karger.com/ProdukteDB/produkte.asp?doi=475716). Through forward searches, we identified additional evidence of psychometric strength for 10 of the identified questionnaires.
Psychometric Evidence
The number of psychometric criteria that questionnaires fulfilled varied from 0 to 8, and only one questionnaire, the Weight Self-Stigma Questionnaire (27), met all eight of our criteria. Internal consistency was the most commonly addressed criterion in scale development articles. 36 questionnaires reported Cronbach's α ≥ 0.70 for at least one subscale or sample. Content validity was addressed for 24 questionnaires. Item generation and selection procedures often included a combination of literature review, development of an over-inclusive initial item pool, review of items for clarity, examination of item factor loadings, and elimination of items with low factor loadings. Convergent validity was addressed for 26 of 40 scales. Fewer scale development articles addressed the criteria of theoretical clarity (n = 20), structural validity (n = 17), discriminant validity (n = 10), and sensitivity to change (n = 17). Test-retest reliability was rated positively for only 3 questionnaires (#17, #21, #27), which each reported correlations > 0.70 for ≥1 subscale. One questionnaire (#18) had a 2-week test-retest reliability of 0.62.
Dimensions and Psychometric Evidence by Population
General Population
23 scales were designed for use in the general population (#2, #3, #5, #7, #8, #10-14, #16, #19, #22, #23, #25, #26, #29, #30, #33, #34, #37-39)]. Most questionnaires included items that assessed the key dimensions of blame and stereotypes. The Anti-Fat Attitudes Test (#26) assesses the dimensions of blame, romantic or physical attractiveness and social/character disparagement (dislike), and fulfilled seven of eight criteria. The Anti-Fat Attitudes Questionnaire (#13) similarly fulfilled seven criteria and assesses the dimensions of dislike, willpower (blame), and fear of fat. Self-report questionnaires explicitly assessing stereotypes of individuals with overweight or obesity include the Fat Phobia Scale (#37) and the Attitudes Toward Obese Persons Scale (#2), which each fulfilled six criteria, as well as the Fat Phobia Scale - Short Form (#5) and the Obese Person Trait Survey (#34), which each fulfilled five criteria. The Universal Measure of Bias - Fat Scale (#25) assesses, among others, the dimension of attractiveness and fulfilled six criteria. These questionnaires have been employed with the general population as well as with clinical samples and adolescents.
Children and Youth
Three scales were designed for use with children and youth (#4, #9, #18). Anesbury and Tiggeman (#4) administered an unnamed measure developed to assess beliefs about the controllability of weight to Australian schoolchildren (mean age = 10 years old), meeting five criteria. The other scales fulfilled 3 or fewer criteria, suggesting a need for further validation.
Healthcare Professionals
Scales designed for use in the general population have been frequently employed in samples of healthcare professionals to gauge levels of weight bias not specifically directed at patients; for a literature review of weight bias among nurses see Brown (#13). Eight scales were designed specifically to assess weight bias in healthcare settings (#1, #6, #21, #24, #28, #35, #36, #40). These scales commonly examined the dimensions of professional competence in caring for individuals with overweight or obesity and beliefs about compliance with weight loss treatment. For medical students, the Nutrition, Exercise, and Weight Management Attitudes Scale (#21) fulfilled six criteria. For nurses, the Nurses' Attitudes Toward Obesity and Obese Patients Scale (#40) fulfilled five criteria.
Physical Educators
Two scales were designed for use with physical educators (#17, #32) in order to assess attitudes toward youth of larger sizes. The Expectations Questionnaire (#17) met four criteria, while Peters and Jones' measure (#32) met two criteria.
Self-Stigma
Three questionnaires were designed to assess internalized weight bias (#15, #27, #31). The Weight Self-Stigma Questionnaire (#27) examined the dimensions of fear of enacted stigma and self-devaluation/shame and fulfilled eight criteria. The Weight Bias Internalization Scale (WBIS) (#15) examined endorsement of negative stereotypes and self-statements, fulfilled seven criteria, and has demonstrated strong psychometric properties in samples of adolescents (#14). The Modified Weight Bias Internalization Scale (#31) is a version of the WBIS adapted for use with individuals of diverse weight categories and fulfilled five criteria.
Parents
One scale was designed to assess parents' attitudes toward children with overweight (#20). This scale met four criteria.
Use of Weight Terminology and People-First Language
Terminology used to describe weight in questionnaire instructions and items appears in Appendix B (available at http://content.karger.com/ProdukteDB/produkte.asp?doi=475716). 19 questionnaires used the terms ‘obese' or ‘obesity', 10 used the term ‘fat', and 11 used the term ‘overweight'. Questionnaires assessing internalized weight bias referred to ‘my weight,' ‘my weight problems,' or ‘being overweight'. People-first language was exceedingly rare in the items administered to participants: 37 of 40 questionnaires used premodified nouns, with the term ‘obese,' ‘overweight,' or ‘fat' as an adjective appearing first in order (i.e., ‘fat people'). One questionnaire used the phrase ‘men and women who are substantially overweight' (#30); whether this constitutes people-first language is debatable as ‘overweight' appears later in the phrase but is still used as an adjective.
Discussion
Psychometric Characteristics
Almost all scales had evidence of adequate internal consistency, and many had evidence of content validity and convergent validity; however, most scales were missing crucial elements of scale development. For example, the theory which guided scale construction was clearly articulated for only 20 of the 40 questionnaires, and factor structure was investigated for only 17 questionnaires. This finding suggests a need for researchers to clearly articulate conceptualizations of weight bias and to continue to investigate this construct empirically.
The finding that Cronbach's α was the most commonly reported psychometric statistic represents a widespread trend [15], yet may be problematic. Although Cronbach's α indicates internal consistency or unidimensionality, researchers often mistakenly equate it with scale reliability or validity [8], and fail to seek other important psychometric evidence. To promote optimal measurement of weight bias, emphasis needs to extend to other psychometric properties, such as test-retest reliability, content, structural, and construct validity.
Terminology and People-First Language
Terminology preferences have been the subject of much debate in recent years [16,17]: whereas the Obesity Society in 2014 called for ‘people-first language' [18], some groups have advocated for the use of neutral terminology (e.g., ‘higher weight'), and individuals in the fat acceptance community have sought to reclaim the term ‘fat' as a neutral descriptor [19]. Because many scales included in our review were developed before terminology preferences became a topic of considerable debate, questionnaire items have employed terminology that may now be considered pejorative by some. Research has demonstrated powerful effects of terminology: in one study, individuals who completed a measure of tolerance toward ‘the mentally ill' had lower tolerance scores than individuals given an otherwise identical questionnaire that employed the people-first term ‘people with mental illnesses' [20,21]. Though colloquial or pejorative terminology may be intentionally used in order to encourage respondents to freely express negative attitudes, it may have other unintended consequences, such as normalizing weight bias. Conversely, participants may be more likely to start using people-first or neutral terminology themselves if they are exposed to these terms in questionnaire items. We recommend that future research empirically examine the impact of weight-related terminology, possibly, for example, by manipulating questionnaire terminology and assessing effects on implicit associations.
Limitations and Future Directions
We note several limitations of the present review. First, a lack of psychometric evidence for a questionnaire may indicate a need for further evaluation in future research rather than a flawed scale. Many scales have been used with additional populations (i.e., adults vs. adolescents), but it was not feasible to document scales' psychometric characteristics in each sample for the present review. One possible way to facilitate ongoing evaluation of questionnaires would be through the creation of a repository that could be continually updated. Second, although we aimed to err on the side of over-inclusivity, we may nevertheless have overlooked relevant studies. This possibility was minimized by ensuring that the search strategy retrieved known relevant studies, and by checking reference lists of important articles. Third, although we examined the questionnaires' use of terminology and people-first language, items may cause offense in other ways, e.g., by reinforcing weight-based stereotypes. Fourth, many of our criteria pertained to whether psychometric properties were addressed, not to the quality of resulting evidence. While the criteria of internal consistency and test-retest reliability were rated according to whether values exceeded accepted cut-offs indicating adequacy, no such cut-offs exist for other psychometric criteria. Appendix B (available at http://content.karger.com/ProdukteDB/produkte.asp?doi=475716) provides additional details about each questionnaire's psychometric properties and other characteristics.
Findings from this review describe the extent of current evidence for the reliability and validity of weight bias questionnaires. Psychometrically strong scales are available for the assessment of internalized weight bias and of weight bias both in the general population and among healthcare professionals. However, investigation of many important types of psychometric evidence has frequently been missing from scale development articles. We encourage the ongoing investigation of a variety of psychometric properties across settings and contexts. Test developers and users alike should strive to maximize measurement quality by investigating psychometric properties such as test-retest reliability, structural validity, discriminant validity, and sensitivity to change.
In addition to gaps identified in breadth of psychometric evidence, the present review highlighted the complexity of the construct of weight bias. Many possible dimensions of weight bias were identified, but fewer than half of the articles reviewed had clearly articulated the theories that had guided scale development. Our understanding of weight bias would benefit from additional research to clarify the underlying structure of the construct, prioritize theoretical clarity by providing clear definitions of the dimensions of interest, and investigate differences in questionnaire design and item wording that may influence the reliability and validity of weight bias measurement.
Acknowledgements
Emilie Lacroix is supported by a Fonds de recherche du Québec - Santé (FRQS) Masterʼs Training Award. Angela S. Alberga is supported by a Banting Canadian Institutes of Health Research (CIHR) Postdoctoral Fellowship. Lindsay McLaren is supported by an Applied Public Health Chair award funded by CIHR (Institute of Population and Public Health and Institute of Musculoskeletal Health and Arthritis), the Public Health Agency of Canada, and Alberta Innovates - Health Solutions. This research was also supported by a Meeting Grant from Campus Alberta Health Outcomes & Public Health. We gratefully acknowledge Dr. Alix Hayden, the University of Calgary Health Sciences librarian whose consultation was indispensable in the development of our search strategy.
Disclosure Statement
The authors declared no conflict of interest.


