Validity In Psychology Research: Types & Examples

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it’s intended to measure. It ensures that the research findings are genuine and not due to extraneous factors.

Validity can be categorized into different types based on internal and external validity .

The concept of validity was formulated by Kelly (1927, p. 14), who stated that a test is valid if it measures what it claims to measure. For example, a test of intelligence should measure intelligence and not something else (such as memory).

Internal and External Validity In Research

Internal validity refers to whether the effects observed in a study are due to the manipulation of the independent variable and not some other confounding factor.

In other words, there is a causal relationship between the independent and dependent variables .

Internal validity can be improved by controlling extraneous variables, using standardized instructions, counterbalancing, and eliminating demand characteristics and investigator effects.

External validity refers to the extent to which the results of a study can be generalized to other settings (ecological validity), other people (population validity), and over time (historical validity).

External validity can be improved by setting experiments more naturally and using random sampling to select participants.

Types of Validity In Psychology

Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion.

  • Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items adequately cover the topic or concept.
  • Criterion validity assesses the performance of a test based on its correlation with a known external criterion or outcome. It can be further divided into concurrent (measured at the same time) and predictive (measuring future performance) validity.

table showing the different types of validity

Face Validity

Face validity is simply whether the test appears (at face value) to measure what it claims to. This is the least sophisticated measure of content-related validity, and is a superficial and subjective assessment based on appearance.

Tests wherein the purpose is clear, even to naïve respondents, are said to have high face validity. Accordingly, tests wherein the purpose is unclear have low face validity (Nevo, 1985).

A direct measurement of face validity is obtained by asking people to rate the validity of a test as it appears to them. This rater could use a Likert scale to assess face validity.

For example:

  • The test is extremely suitable for a given purpose
  • The test is very suitable for that purpose;
  • The test is adequate
  • The test is inadequate
  • The test is irrelevant and, therefore, unsuitable

It is important to select suitable people to rate a test (e.g., questionnaire, interview, IQ test, etc.). For example, individuals who actually take the test would be well placed to judge its face validity.

Also, people who work with the test could offer their opinion (e.g., employers, university administrators, employers). Finally, the researcher could use members of the general public with an interest in the test (e.g., parents of testees, politicians, teachers, etc.).

The face validity of a test can be considered a robust construct only if a reasonable level of agreement exists among raters.

It should be noted that the term face validity should be avoided when the rating is done by an “expert,” as content validity is more appropriate.

Having face validity does not mean that a test really measures what the researcher intends to measure, but only in the judgment of raters that it appears to do so. Consequently, it is a crude and basic measure of validity.

A test item such as “ I have recently thought of killing myself ” has obvious face validity as an item measuring suicidal cognitions and may be useful when measuring symptoms of depression.

However, the implication of items on tests with clear face validity is that they are more vulnerable to social desirability bias. Individuals may manipulate their responses to deny or hide problems or exaggerate behaviors to present a positive image of themselves.

It is possible for a test item to lack face validity but still have general validity and measure what it claims to measure. This is good because it reduces demand characteristics and makes it harder for respondents to manipulate their answers.

For example, the test item “ I believe in the second coming of Christ ” would lack face validity as a measure of depression (as the purpose of the item is unclear).

This item appeared on the first version of The Minnesota Multiphasic Personality Inventory (MMPI) and loaded on the depression scale.

Because most of the original normative sample of the MMPI were good Christians, only a depressed Christian would think Christ is not coming back. Thus, for this particular religious sample, the item does have general validity but not face validity.

Construct Validity

Construct validity assesses how well a test or measure represents and captures an abstract theoretical concept, known as a construct. It indicates the degree to which the test accurately reflects the construct it intends to measure, often evaluated through relationships with other variables and measures theoretically connected to the construct.

Construct validity was invented by Cronbach and Meehl (1955). This type of content-related validity refers to the extent to which a test captures a specific theoretical construct or trait, and it overlaps with some of the other aspects of validity

Construct validity does not concern the simple, factual question of whether a test measures an attribute.

Instead, it is about the complex question of whether test score interpretations are consistent with a nomological network involving theoretical and observational terms (Cronbach & Meehl, 1955).

To test for construct validity, it must be demonstrated that the phenomenon being measured actually exists. So, the construct validity of a test for intelligence, for example, depends on a model or theory of intelligence .

Construct validity entails demonstrating the power of such a construct to explain a network of research findings and to predict further relationships.

The more evidence a researcher can demonstrate for a test’s construct validity, the better. However, there is no single method of determining the construct validity of a test.

Instead, different methods and approaches are combined to present the overall construct validity of a test. For example, factor analysis and correlational methods can be used.

Convergent validity

Convergent validity is a subtype of construct validity. It assesses the degree to which two measures that theoretically should be related are related.

It demonstrates that measures of similar constructs are highly correlated. It helps confirm that a test accurately measures the intended construct by showing its alignment with other tests designed to measure the same or similar constructs.

For example, suppose there are two different scales used to measure self-esteem:

Scale A and Scale B. If both scales effectively measure self-esteem, then individuals who score high on Scale A should also score high on Scale B, and those who score low on Scale A should score similarly low on Scale B.

If the scores from these two scales show a strong positive correlation, then this provides evidence for convergent validity because it indicates that both scales seem to measure the same underlying construct of self-esteem.

Concurrent Validity (i.e., occurring at the same time)

Concurrent validity evaluates how well a test’s results correlate with the results of a previously established and accepted measure, when both are administered at the same time.

It helps in determining whether a new measure is a good reflection of an established one without waiting to observe outcomes in the future.

If the new test is validated by comparison with a currently existing criterion, we have concurrent validity.

Very often, a new IQ or personality test might be compared with an older but similar test known to have good validity already.

Predictive Validity

Predictive validity assesses how well a test predicts a criterion that will occur in the future. It measures the test’s ability to foresee the performance of an individual on a related criterion measured at a later point in time. It gauges the test’s effectiveness in predicting subsequent real-world outcomes or results.

For example, a prediction may be made on the basis of a new intelligence test that high scorers at age 12 will be more likely to obtain university degrees several years later. If the prediction is born out, then the test has predictive validity.

Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests. Psychological Bulletin , 52, 281-302.

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Kelley, T. L. (1927). Interpretation of educational measurements. New York : Macmillan.

Nevo, B. (1985). Face validity revisited . Journal of Educational Measurement , 22(4), 287-293.

Print Friendly, PDF & Email

research validity meaning

What is the Significance of Validity in Research?

research validity meaning

Introduction

  • What is validity in simple terms?

Internal validity vs. external validity in research

Uncovering different types of research validity, factors that improve research validity.

In qualitative research , validity refers to an evaluation metric for the trustworthiness of study findings. Within the expansive landscape of research methodologies , the qualitative approach, with its rich, narrative-driven investigations, demands unique criteria for ensuring validity.

Unlike its quantitative counterpart, which often leans on numerical robustness and statistical veracity, the essence of validity in qualitative research delves deep into the realms of credibility, dependability, and the richness of the data .

The importance of validity in qualitative research cannot be overstated. Establishing validity refers to ensuring that the research findings genuinely reflect the phenomena they are intended to represent. It reinforces the researcher's responsibility to present an authentic representation of study participants' experiences and insights.

This article will examine validity in qualitative research, exploring its characteristics, techniques to bolster it, and the challenges that researchers might face in establishing validity.

research validity meaning

At its core, validity in research speaks to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure or understand. It's about ensuring that the study investigates what it purports to investigate. While this seems like a straightforward idea, the way validity is approached can vary greatly between qualitative and quantitative research .

Quantitative research often hinges on numerical, measurable data. In this paradigm, validity might refer to whether a specific tool or method measures the correct variable, without interference from other variables. It's about numbers, scales, and objective measurements. For instance, if one is studying personalities by administering surveys, a valid instrument could be a survey that has been rigorously developed and tested to verify that the survey questions are referring to personality characteristics and not other similar concepts, such as moods, opinions, or social norms.

Conversely, qualitative research is more concerned with understanding human behavior and the reasons that govern such behavior. It's less about measuring in the strictest sense and more about interpreting the phenomenon that is being studied. The questions become: "Are these interpretations true representations of the human experience being studied?" and "Do they authentically convey participants' perspectives and contexts?"

research validity meaning

Differentiating between qualitative and quantitative validity is crucial because the research methods to ensure validity differ between these research paradigms. In quantitative realms, validity might involve test-retest reliability or examining the internal consistency of a test.

In the qualitative sphere, however, the focus shifts to ensuring that the researcher's interpretations align with the actual experiences and perspectives of their subjects.

This distinction is fundamental because it impacts how researchers engage in research design , gather data , and draw conclusions . Ensuring validity in qualitative research is like weaving a tapestry: every strand of data must be carefully interwoven with the interpretive threads of the researcher, creating a cohesive and faithful representation of the studied experience.

While often terms associated more closely with quantitative research, internal and external validity can still be relevant concepts to understand within the context of qualitative inquiries. Grasping these notions can help qualitative researchers better navigate the challenges of ensuring their findings are both credible and applicable in wider contexts.

Internal validity

Internal validity refers to the authenticity and truthfulness of the findings within the study itself. In qualitative research , this might involve asking: Do the conclusions drawn genuinely reflect the perspectives and experiences of the study's participants?

Internal validity revolves around the depth of understanding, ensuring that the researcher's interpretations are grounded in participants' realities. Techniques like member checking , where participants review and verify the researcher's interpretations , can bolster internal validity.

External validity

External validity refers to the extent to which the findings of a study can be generalized or applied to other settings or groups. For qualitative researchers, the emphasis isn't on statistical generalizability, as often seen in quantitative studies. Instead, it's about transferability.

It becomes a matter of determining how and where the insights gathered might be relevant in other contexts. This doesn't mean that every qualitative study's findings will apply universally, but qualitative researchers should provide enough detail (through rich, thick descriptions) to allow readers or other researchers to determine the potential for transfer to other contexts.

research validity meaning

Try out a free trial of ATLAS.ti today

See how you can turn your data into critical research findings with our intuitive interface.

Looking deeper into the realm of validity, it's crucial to recognize and understand its various types. Each type offers distinct criteria and methods of evaluation, ensuring that research remains robust and genuine. Here's an exploration of some of these types.

Construct validity

Construct validity is a cornerstone in research methodology . It pertains to ensuring that the tools or methods used in a research study genuinely capture the intended theoretical constructs.

In qualitative research , the challenge lies in the abstract nature of many constructs. For example, if one were to investigate "emotional intelligence" or "social cohesion," the definitions might vary, making them hard to pin down.

research validity meaning

To bolster construct validity, it is important to clearly and transparently define the concepts being studied. In addition, researchers may triangulate data from multiple sources , ensuring that different viewpoints converge towards a shared understanding of the construct. Furthermore, they might delve into iterative rounds of data collection, refining their methods with each cycle to better align with the conceptual essence of their focus.

Content validity

Content validity's emphasis is on the breadth and depth of the content being assessed. In other words, content validity refers to capturing all relevant facets of the phenomenon being studied. Within qualitative paradigms, ensuring comprehensive representation is paramount. If, for instance, a researcher is using interview protocols to understand community perceptions of a local policy, it's crucial that the questions encompass all relevant aspects of that policy. This could range from its implementation and impact to public awareness and opinion variations across demographic groups.

Enhancing content validity can involve expert reviews where subject matter experts evaluate tools or methods for comprehensiveness. Another strategy might involve pilot studies , where preliminary data collection reveals gaps or overlooked aspects that can be addressed in the main study.

Ecological validity

Ecological validity refers to the genuine reflection of real-world situations in research findings. For qualitative researchers, this means their observations , interpretations , and conclusions should resonate with the participants and context being studied.

If a study explores classroom dynamics, for example, studying students and teachers in a controlled research setting would have lower ecological validity than studying real classroom settings. Ecological validity is important to consider because it helps ensure the research is relevant to the people being studied. Individuals might behave entirely different in a controlled environment as opposed to their everyday natural settings.

Ecological validity tends to be stronger in qualitative research compared to quantitative research , because qualitative researchers are typically immersed in their study context and explore participants' subjective perceptions and experiences. Quantitative research, in contrast, can sometimes be more artificial if behavior is being observed in a lab or participants have to choose from predetermined options to answer survey questions.

Qualitative researchers can further bolster ecological validity through immersive fieldwork, where researchers spend extended periods in the studied environment. This immersion helps them capture the nuances and intricacies that might be missed in brief or superficial engagements.

Face validity

Face validity, while seemingly straightforward, holds significant weight in the preliminary stages of research. It serves as a litmus test, gauging the apparent appropriateness and relevance of a tool or method. If a researcher is developing a new interview guide to gauge employee satisfaction, for instance, a quick assessment from colleagues or a focus group can reveal if the questions intuitively seem fit for the purpose.

While face validity is more subjective and lacks the depth of other validity types, it's a crucial initial step, ensuring that the research starts on the right foot.

Criterion validity

Criterion validity evaluates how well the results obtained from one method correlate with those from another, more established method. In many research scenarios, establishing high criterion validity involves using statistical methods to measure validity. For instance, a researcher might utilize the appropriate statistical tests to determine the strength and direction of the linear relationship between two sets of data.

If a new measurement tool or method is being introduced, its validity might be established by statistically correlating its outcomes with those of a gold standard or previously validated tool. Correlational statistics can estimate the strength of the relationship between the new instrument and the previously established instrument, and regression analyses can also be useful to predict outcomes based on established criteria.

While these methods are traditionally aligned with quantitative research, qualitative researchers, particularly those using mixed methods , may also find value in these statistical approaches, especially when wanting to quantify certain aspects of their data for comparative purposes. More broadly, qualitative researchers could compare their operationalizations and findings to other similar qualitative studies to assess that they are indeed examining what they intend to study.

In the realm of qualitative research , the role of the researcher is not just that of an observer but often as an active participant in the meaning-making process. This unique positioning means the researcher's perspectives and interactions can significantly influence the data collected and its interpretation . Here's a deep dive into the researcher's pivotal role in upholding validity.

Reflexivity

A key concept in qualitative research, reflexivity requires researchers to continually reflect on their worldviews, beliefs, and potential influence on the data. By maintaining a reflexive journal or engaging in regular introspection, researchers can identify and address their own biases , ensuring a more genuine interpretation of participant narratives.

Building rapport

The depth and authenticity of information shared by participants often hinge on the rapport and trust established with the researcher. By cultivating genuine, non-judgmental, and empathetic relationships with participants, researchers can enhance the validity of the data collected.

Positionality

Every researcher brings to the study their own background, including their culture, education, socioeconomic status, and more. Recognizing how this positionality might influence interpretations and interactions is crucial. By acknowledging and transparently sharing their positionality, researchers can offer context to their findings and interpretations.

Active listening

The ability to listen without imposing one's own judgments or interpretations is vital. Active listening ensures that researchers capture the participants' experiences and emotions without distortion, enhancing the validity of the findings.

Transparency in methods

To ensure validity, researchers should be transparent about every step of their process. From how participants were selected to how data was analyzed , a clear documentation offers others a chance to understand and evaluate the research's authenticity and rigor .

Member checking

Once data is collected and interpreted, revisiting participants to confirm the researcher's interpretations can be invaluable. This process, known as member checking , ensures that the researcher's understanding aligns with the participants' intended meanings, bolstering validity.

Embracing ambiguity

Qualitative data can be complex and sometimes contradictory. Instead of trying to fit data into preconceived notions or frameworks, researchers must embrace ambiguity, acknowledging areas of uncertainty or multiple interpretations.

research validity meaning

Make the most of your research study with ATLAS.ti

From study design to data analysis, let ATLAS.ti guide you through the research process. Download a free trial today.

research validity meaning

Validity in research: a guide to measuring the right things

Last updated

27 February 2023

Reviewed by

Cathy Heath

Short on time? Get an AI generated summary of this article instead

Validity is necessary for all types of studies ranging from market validation of a business or product idea to the effectiveness of medical trials and procedures. So, how can you determine whether your research is valid? This guide can help you understand what validity is, the types of validity in research, and the factors that affect research validity.

Make research less tedious

Dovetail streamlines research to help you uncover and share actionable insights

  • What is validity?

In the most basic sense, validity is the quality of being based on truth or reason. Valid research strives to eliminate the effects of unrelated information and the circumstances under which evidence is collected. 

Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge.

Studies must be conducted in environments that don't sway the results to achieve and maintain validity. They can be compromised by asking the wrong questions or relying on limited data. 

Why is validity important in research?

Research is used to improve life for humans. Every product and discovery, from innovative medical breakthroughs to advanced new products, depends on accurate research to be dependable. Without it, the results couldn't be trusted, and products would likely fail. Businesses would lose money, and patients couldn't rely on medical treatments. 

While wasting money on a lousy product is a concern, lack of validity paints a much grimmer picture in the medical field or producing automobiles and airplanes, for example. Whether you're launching an exciting new product or conducting scientific research, validity can determine success and failure.

  • What is reliability?

Reliability is the ability of a method to yield consistency. If the same result can be consistently achieved by using the same method to measure something, the measurement method is said to be reliable. For example, a thermometer that shows the same temperatures each time in a controlled environment is reliable.

While high reliability is a part of measuring validity, it's only part of the puzzle. If the reliable thermometer hasn't been properly calibrated and reliably measures temperatures two degrees too high, it doesn't provide a valid (accurate) measure of temperature. 

Similarly, if a researcher uses a thermometer to measure weight, the results won't be accurate because it's the wrong tool for the job. 

  • How are reliability and validity assessed?

While measuring reliability is a part of measuring validity, there are distinct ways to assess both measurements for accuracy. 

How is reliability measured?

These measures of consistency and stability help assess reliability, including:

Consistency and stability of the same measure when repeated multiple times and conditions

Consistency and stability of the measure across different test subjects

Consistency and stability of results from different parts of a test designed to measure the same thing

How is validity measured?

Since validity refers to how accurately a method measures what it is intended to measure, it can be difficult to assess the accuracy. Validity can be estimated by comparing research results to other relevant data or theories.

The adherence of a measure to existing knowledge of how the concept is measured

The ability to cover all aspects of the concept being measured

The relation of the result in comparison with other valid measures of the same concept

  • What are the types of validity in a research design?

Research validity is broadly gathered into two groups: internal and external. Yet, this grouping doesn't clearly define the different types of validity. Research validity can be divided into seven distinct groups.

Face validity : A test that appears valid simply because of the appropriateness or relativity of the testing method, included information, or tools used.

Content validity : The determination that the measure used in research covers the full domain of the content.

Construct validity : The assessment of the suitability of the measurement tool to measure the activity being studied.

Internal validity : The assessment of how your research environment affects measurement results. This is where other factors can’t explain the extent of an observed cause-and-effect response.

External validity : The extent to which the study will be accurate beyond the sample and the level to which it can be generalized in other settings, populations, and measures.

Statistical conclusion validity: The determination of whether a relationship exists between procedures and outcomes (appropriate sampling and measuring procedures along with appropriate statistical tests).

Criterion-related validity : A measurement of the quality of your testing methods against a criterion measure (like a “gold standard” test) that is measured at the same time.

  • Examples of validity

Like different types of research and the various ways to measure validity, examples of validity can vary widely. These include:

A questionnaire may be considered valid because each question addresses specific and relevant aspects of the study subject.

In a brand assessment study, researchers can use comparison testing to verify the results of an initial study. For example, the results from a focus group response about brand perception are considered more valid when the results match that of a questionnaire answered by current and potential customers.

A test to measure a class of students' understanding of the English language contains reading, writing, listening, and speaking components to cover the full scope of how language is used.

  • Factors that affect research validity

Certain factors can affect research validity in both positive and negative ways. By understanding the factors that improve validity and those that threaten it, you can enhance the validity of your study. These include:

Random selection of participants vs. the selection of participants that are representative of your study criteria

Blinding with interventions the participants are unaware of (like the use of placebos)

Manipulating the experiment by inserting a variable that will change the results

Randomly assigning participants to treatment and control groups to avoid bias

Following specific procedures during the study to avoid unintended effects

Conducting a study in the field instead of a laboratory for more accurate results

Replicating the study with different factors or settings to compare results

Using statistical methods to adjust for inconclusive data

What are the common validity threats in research, and how can their effects be minimized or nullified?

Research validity can be difficult to achieve because of internal and external threats that produce inaccurate results. These factors can jeopardize validity.

History: Events that occur between an early and later measurement

Maturation: The passage of time in a study can include data on actions that would have naturally occurred outside of the settings of the study

Repeated testing: The outcome of repeated tests can change the outcome of followed tests

Selection of subjects: Unconscious bias which can result in the selection of uniform comparison groups

Statistical regression: Choosing subjects based on extremes doesn't yield an accurate outcome for the majority of individuals

Attrition: When the sample group is diminished significantly during the course of the study

Maturation: When subjects mature during the study, and natural maturation is awarded to the effects of the study

While some validity threats can be minimized or wholly nullified, removing all threats from a study is impossible. For example, random selection can remove unconscious bias and statistical regression. 

Researchers can even hope to avoid attrition by using smaller study groups. Yet, smaller study groups could potentially affect the research in other ways. The best practice for researchers to prevent validity threats is through careful environmental planning and t reliable data-gathering methods. 

  • How to ensure validity in your research

Researchers should be mindful of the importance of validity in the early planning stages of any study to avoid inaccurate results. Researchers must take the time to consider tools and methods as well as how the testing environment matches closely with the natural environment in which results will be used.

The following steps can be used to ensure validity in research:

Choose appropriate methods of measurement

Use appropriate sampling to choose test subjects

Create an accurate testing environment

How do you maintain validity in research?

Accurate research is usually conducted over a period of time with different test subjects. To maintain validity across an entire study, you must take specific steps to ensure that gathered data has the same levels of accuracy. 

Consistency is crucial for maintaining validity in research. When researchers apply methods consistently and standardize the circumstances under which data is collected, validity can be maintained across the entire study.

Is there a need for validation of the research instrument before its implementation?

An essential part of validity is choosing the right research instrument or method for accurate results. Consider the thermometer that is reliable but still produces inaccurate results. You're unlikely to achieve research validity without activities like calibration, content, and construct validity.

  • Understanding research validity for more accurate results

Without validity, research can't provide the accuracy necessary to deliver a useful study. By getting a clear understanding of validity in research, you can take steps to improve your research skills and achieve more accurate results.

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 6 February 2023

Last updated: 6 October 2023

Last updated: 5 February 2023

Last updated: 16 April 2023

Last updated: 9 March 2023

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next.

research validity meaning

Users report unexpectedly high data usage, especially during streaming sessions.

research validity meaning

Users find it hard to navigate from the home page to relevant playlists in the app.

research validity meaning

It would be great to have a sleep timer feature, especially for bedtime listening.

research validity meaning

I need better filters to find the songs or artists I’m looking for.

Log in or sign up

Get started for free

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • The 4 Types of Validity | Types, Definitions & Examples

The 4 Types of Validity | Types, Definitions & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 10 October 2022.

In quantitative research , you have to consider the reliability and validity of your methods and measurements.

Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid. There are four main types of validity:

  • Construct validity : Does the test measure the concept that it’s intended to measure?
  • Content validity : Is the test fully representative of what it aims to measure?
  • Face validity : Does the content of the test appear to be suitable to its aims?
  • Criterion validity : Do the results accurately measure the concrete outcome they are designed to measure?

Note that this article deals with types of test validity, which determine the accuracy of the actual components of a measure. If you are doing experimental research, you also need to consider internal and external validity , which deal with the experimental design and the generalisability of results.

Table of contents

Construct validity, content validity, face validity, criterion validity.

Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method.

What is a construct?

A construct refers to a concept or characteristic that can’t be directly observed but can be measured by observing other indicators that are associated with it.

Constructs can be characteristics of individuals, such as intelligence, obesity, job satisfaction, or depression; they can also be broader concepts applied to organisations or social groups, such as gender equality, corporate social responsibility, or freedom of speech.

What is construct validity?

Construct validity is about ensuring that the method of measurement matches the construct you want to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or some other construct?

To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed based on relevant existing knowledge. The questionnaire must include only relevant questions that measure known indicators of depression.

The other types of validity described below can all be considered as forms of evidence for construct validity.

Prevent plagiarism, run a free check.

Content validity assesses whether a test is representative of all aspects of the construct.

To produce valid results, the content of a test, survey, or measurement method must cover all relevant parts of the subject it aims to measure. If some aspects are missing from the measurement (or if irrelevant aspects are included), the validity is threatened.

Face validity considers how suitable the content of a test seems to be on the surface. It’s similar to content validity, but face validity is a more informal and subjective assessment.

As face validity is a subjective measure, it’s often considered the weakest form of validity. However, it can be useful in the initial stages of developing a method.

Criterion validity evaluates how well a test can predict a concrete outcome, or how well the results of your test approximate the results of another test.

What is a criterion variable?

A criterion variable is an established and effective measurement that is widely considered valid, sometimes referred to as a ‘gold standard’ measurement. Criterion variables can be very difficult to find.

What is criterion validity?

To evaluate criterion validity, you calculate the correlation between the results of your measurement and the results of the criterion measurement. If there is a high correlation, this gives a good indication that your test is measuring what it intends to measure.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, October 10). The 4 Types of Validity | Types, Definitions & Examples. Scribbr. Retrieved 8 July 2024, from https://www.scribbr.co.uk/research-methods/validity-types/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, qualitative vs quantitative research | examples & methods, a quick guide to experimental design | 5 steps & examples, what is qualitative research | methods & examples.

Grad Coach

Validity & Reliability In Research

A Plain-Language Explanation (With Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Kerryn Warren (PhD) | September 2023

Validity and reliability are two related but distinctly different concepts within research. Understanding what they are and how to achieve them is critically important to any research project. In this post, we’ll unpack these two concepts as simply as possible.

This post is based on our popular online course, Research Methodology Bootcamp . In the course, we unpack the basics of methodology  using straightfoward language and loads of examples. If you’re new to academic research, you definitely want to use this link to get 50% off the course (limited-time offer).

Overview: Validity & Reliability

  • The big picture
  • Validity 101
  • Reliability 101 
  • Key takeaways

First, The Basics…

First, let’s start with a big-picture view and then we can zoom in to the finer details.

Validity and reliability are two incredibly important concepts in research, especially within the social sciences. Both validity and reliability have to do with the measurement of variables and/or constructs – for example, job satisfaction, intelligence, productivity, etc. When undertaking research, you’ll often want to measure these types of constructs and variables and, at the simplest level, validity and reliability are about ensuring the quality and accuracy of those measurements .

As you can probably imagine, if your measurements aren’t accurate or there are quality issues at play when you’re collecting your data, your entire study will be at risk. Therefore, validity and reliability are very important concepts to understand (and to get right). So, let’s unpack each of them.

Free Webinar: Research Methodology 101

What Is Validity?

In simple terms, validity (also called “construct validity”) is all about whether a research instrument accurately measures what it’s supposed to measure .

For example, let’s say you have a set of Likert scales that are supposed to quantify someone’s level of overall job satisfaction. If this set of scales focused purely on only one dimension of job satisfaction, say pay satisfaction, this would not be a valid measurement, as it only captures one aspect of the multidimensional construct. In other words, pay satisfaction alone is only one contributing factor toward overall job satisfaction, and therefore it’s not a valid way to measure someone’s job satisfaction.

research validity meaning

Oftentimes in quantitative studies, the way in which the researcher or survey designer interprets a question or statement can differ from how the study participants interpret it . Given that respondents don’t have the opportunity to ask clarifying questions when taking a survey, it’s easy for these sorts of misunderstandings to crop up. Naturally, if the respondents are interpreting the question in the wrong way, the data they provide will be pretty useless . Therefore, ensuring that a study’s measurement instruments are valid – in other words, that they are measuring what they intend to measure – is incredibly important.

There are various types of validity and we’re not going to go down that rabbit hole in this post, but it’s worth quickly highlighting the importance of making sure that your research instrument is tightly aligned with the theoretical construct you’re trying to measure .  In other words, you need to pay careful attention to how the key theories within your study define the thing you’re trying to measure – and then make sure that your survey presents it in the same way.

For example, sticking with the “job satisfaction” construct we looked at earlier, you’d need to clearly define what you mean by job satisfaction within your study (and this definition would of course need to be underpinned by the relevant theory). You’d then need to make sure that your chosen definition is reflected in the types of questions or scales you’re using in your survey . Simply put, you need to make sure that your survey respondents are perceiving your key constructs in the same way you are. Or, even if they’re not, that your measurement instrument is capturing the necessary information that reflects your definition of the construct at hand.

If all of this talk about constructs sounds a bit fluffy, be sure to check out Research Methodology Bootcamp , which will provide you with a rock-solid foundational understanding of all things methodology-related. Remember, you can take advantage of our 60% discount offer using this link.

Need a helping hand?

research validity meaning

What Is Reliability?

As with validity, reliability is an attribute of a measurement instrument – for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the “thing” it’s supposed to be measuring, reliability is concerned with consistency and stability . In other words, reliability reflects the degree to which a measurement instrument produces consistent results when applied repeatedly to the same phenomenon , under the same conditions .

As you can probably imagine, a measurement instrument that achieves a high level of consistency is naturally more dependable (or reliable) than one that doesn’t – in other words, it can be trusted to provide consistent measurements . And that, of course, is what you want when undertaking empirical research. If you think about it within a more domestic context, just imagine if you found that your bathroom scale gave you a different number every time you hopped on and off of it – you wouldn’t feel too confident in its ability to measure the variable that is your body weight 🙂

It’s worth mentioning that reliability also extends to the person using the measurement instrument . For example, if two researchers use the same instrument (let’s say a measuring tape) and they get different measurements, there’s likely an issue in terms of how one (or both) of them are using the measuring tape. So, when you think about reliability, consider both the instrument and the researcher as part of the equation.

As with validity, there are various types of reliability and various tests that can be used to assess the reliability of an instrument. A popular one that you’ll likely come across for survey instruments is Cronbach’s alpha , which is a statistical measure that quantifies the degree to which items within an instrument (for example, a set of Likert scales) measure the same underlying construct . In other words, Cronbach’s alpha indicates how closely related the items are and whether they consistently capture the same concept . 

Reliability reflects whether an instrument produces consistent results when applied to the same phenomenon, under the same conditions.

Recap: Key Takeaways

Alright, let’s quickly recap to cement your understanding of validity and reliability:

  • Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it’s supposed to measure
  • Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions.

In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions . So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes, “rubbish in, rubbish out” – make sure that your data inputs are rock-solid.

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

Research aims, research objectives and research questions

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS.

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS AND I HAVE GREATLY BENEFITED FROM THE CONTENT.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly
  • How it works

researchprospect post subheader

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threat Definition Example
Confounding factors Unexpected events during the experiment that are not a part of treatment. If you feel the increased weight of your experiment participants is due to lack of physical activity, but it was actually due to the consumption of coffee with sugar.
Maturation The influence on the independent variable due to passage of time. During a long-term experiment, subjects may feel tired, bored, and hungry.
Testing The results of one test affect the results of another test. Participants of the first experiment may react differently during the second experiment.
Instrumentation Changes in the instrument’s collaboration Change in the   may give different results instead of the expected results.
Statistical regression Groups selected depending on the extreme scores are not as extreme on subsequent testing. Students who failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection bias Choosing comparison groups without randomisation. A group of trained and efficient teachers is selected to teach children communication skills instead of randomly selecting them.
Experimental mortality Due to the extension of the time of the experiment, participants may leave the experiment. Due to multi-tasking and various competition levels, the participants may leave the competition because they are dissatisfied with the time-extension even if they were doing well.

Threats of External Validity

Threat Definition Example
Reactive/interactive effects of testing The participants of the pre-test may get awareness about the next experiment. The treatment may not be effective without the pre-test. Students who got failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection of participants A group of participants selected with specific characteristics and the treatment of the experiment may work only on the participants possessing those characteristics If an experiment is conducted specifically on the health issues of pregnant women, the same treatment cannot be given to male participants.

How to Assess Reliability and Validity?

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Type of reliability What does it measure? Example
Test-Retests It measures the consistency of the results at different points of time. It identifies whether the results are the same after repeated measures. Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from a various group of participants, it means the validity of the questionnaire and product is high as it has high test-retest reliability.
Inter-Rater It measures the consistency of the results at the same time by different raters (researchers) Suppose five researchers measure the academic performance of the same student by incorporating various questions from all the academic subjects and submit various results. It shows that the questionnaire has low inter-rater reliability.
Parallel Forms It measures Equivalence. It includes different forms of the same test performed on the same participants. Suppose the same researcher conducts the two different forms of tests on the same topic and the same students. The tests could be written and oral tests on the same topic. If results are the same, then the parallel-forms reliability of the test is high; otherwise, it’ll be low if the results are different.
Inter-Term It measures the consistency of the measurement. The results of the same tests are split into two halves and compared with each other. If there is a lot of difference in results, then the inter-term reliability of the test is low.

Types of Validity

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Type of reliability What does it measure? Example
Content validity It shows whether all the aspects of the test/measurement are covered. A language test is designed to measure the writing and reading skills, listening, and speaking skills. It indicates that a test has high content validity.
Face validity It is about the validity of the appearance of a test or procedure of the test. The type of   included in the question paper, time, and marks allotted. The number of questions and their categories. Is it a good question paper to measure the academic performance of students?
Construct validity It shows whether the test is measuring the correct construct (ability/attribute, trait, skill) Is the test conducted to measure communication skills is actually measuring communication skills?
Criterion validity It shows whether the test scores obtained are similar to other measures of the same concept. The results obtained from a prefinal exam of graduate accurately predict the results of the later final exam. It shows that the test has high criterion validity.

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Segments Explanation
All the planning about reliability and validity will be discussed here, including the chosen samples and size and the techniques used to measure reliability and validity.
Please talk about the level of reliability and validity of your results and their influence on values.
Discuss the contribution of other researchers to improve reliability and validity.

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

The authenticity of dissertation is largely influenced by the research method employed. Here we present the most notable research methods for dissertation.

This post provides the key disadvantages of secondary research so you know the limitations of secondary research before making a decision.

This article presents the key advantages and disadvantages of secondary research so you can select the most appropriate research approach for your study.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

research validity meaning

Home Market Research

Reliability vs. Validity in Research: Types & Examples

Explore how reliability vs validity in research determines quality. Learn the differences and types + examples. Get insights!

When it comes to research, getting things right is crucial. That’s where the concepts of “Reliability vs Validity in Research” come in. 

Imagine it like a balancing act – making sure your measurements are consistent and accurate at the same time. This is where test-retest reliability, having different researchers check things, and keeping things consistent within your research plays a big role. 

As we dive into this topic, we’ll uncover the differences between reliability and validity, see how they work together, and learn how to use them effectively.

Understanding Reliability vs. Validity in Research

When it comes to collecting data and conducting research, two crucial concepts stand out: reliability and validity. 

These pillars uphold the integrity of research findings, ensuring that the data collected and the conclusions drawn are both meaningful and trustworthy. Let’s dive into the heart of the concepts, reliability, and validity, to comprehend their significance in the realm of research truly.

What is reliability?

Reliability refers to the consistency and dependability of the data collection process. It’s like having a steady hand that produces the same result each time it reaches for a task. 

In the research context, reliability is all about ensuring that if you were to repeat the same study using the same reliable measurement technique, you’d end up with the same results. It’s like having multiple researchers independently conduct the same experiment and getting outcomes that align perfectly.

Imagine you’re using a thermometer to measure the temperature of the water. You have a reliable measurement if you dip the thermometer into the water multiple times and get the same reading each time. This tells you that your method and measurement technique consistently produce the same results, whether it’s you or another researcher performing the measurement.

What is validity?

On the other hand, validity refers to the accuracy and meaningfulness of your data. It’s like ensuring that the puzzle pieces you’re putting together actually form the intended picture. When you have validity, you know that your method and measurement technique are consistent and capable of producing results aligned with reality.

Think of it this way; Imagine you’re conducting a test that claims to measure a specific trait, like problem-solving ability. If the test consistently produces results that accurately reflect participants’ problem-solving skills, then the test has high validity. In this case, the test produces accurate results that truly correspond to the trait it aims to measure.

In essence, while reliability assures you that your data collection process is like a well-oiled machine producing the same results, validity steps in to ensure that these results are not only consistent but also relevantly accurate. 

Together, these concepts provide researchers with the tools to conduct research that stands on a solid foundation of dependable methods and meaningful insights.

Types of Reliability

Let’s explore the various types of reliability that researchers consider to ensure their work stands on solid ground.

High test-retest reliability

Test-retest reliability involves assessing the consistency of measurements over time. It’s like taking the same measurement or test twice – once and then again after a certain period. If the results align closely, it indicates that the measurement is reliable over time. Think of it as capturing the essence of stability. 

Inter-rater reliability

When multiple researchers or observers are part of the equation, interrater reliability comes into play. This type of reliability assesses the level of agreement between different observers when evaluating the same phenomenon. It’s like ensuring that different pairs of eyes perceive things in a similar way. 

Internal reliability

Internal consistency dives into the harmony among different items within a measurement tool aiming to assess the same concept. This often comes into play in surveys or questionnaires, where participants respond to various items related to a single construct. If the responses to these items consistently reflect the same underlying concept, the measurement is said to have high internal consistency. 

Types of validity

Let’s explore the various types of validity that researchers consider to ensure their work stands on solid ground.

Content validity

It delves into whether a measurement truly captures all dimensions of the concept it intends to measure. It’s about making sure your measurement tool covers all relevant aspects comprehensively. 

Imagine designing a test to assess students’ understanding of a history chapter. It exhibits high content validity if the test includes questions about key events, dates, and causes. However, if it focuses solely on dates and omits causation, its content validity might be questionable.

Construct validity

It assesses how well a measurement aligns with established theories and concepts. It’s like ensuring that your measurement is a true representation of the abstract construct you’re trying to capture. 

Criterion validity

Criterion validity examines how well your measurement corresponds to other established measurements of the same concept. It’s about making sure your measurement accurately predicts or correlates with external criteria.

Differences between reliability and validity in research

Let’s delve into the differences between reliability and validity in research.

NoCategoryReliabilityValidity
01MeaningFocuses on the consistency of measurements over time and conditions.Concerns about the accuracy and relevance of measurements in capturing the intended concept.
02What it assessesAssesses whether the same results can be obtained consistently from repeated measurements.Assesses whether measurements truly measure what they are intended to measure.
03Assessment methodsEvaluated through test-retest consistency, interrater agreement, and internal consistency.Assessed through content coverage, construct alignment, and criterion correlation.
04InterrelationA measurement can be reliable (consistent) without being valid (accurate).A valid measurement is typically reliable, but high reliability doesn’t guarantee validity.
05ImportanceEnsures data consistency and replicabilityGuarantees meaningful and credible results.
06FocusFocuses on the stability and consistency of measurement outcomes.Focuses on the meaningfulness and accuracy of measurement outcomes.
07OutcomeReproducibility of measurements is the key outcome.Meaningful and accurate measurement outcomes are the primary goal.

While both reliability and validity contribute to trustworthy research, they address distinct aspects. Reliability ensures consistent results, while validity ensures accurate and relevant results that reflect the true nature of the measured concept.

Example of Reliability and Validity in Research

In this section, we’ll explore instances that highlight the differences between reliability and validity and how they play a crucial role in ensuring the credibility of research findings.

Example of reliability

Imagine you are studying the reliability of a smartphone’s battery life measurement. To collect data, you fully charge the phone and measure the battery life three times in the same controlled environment—same apps running, same brightness level, and same usage patterns. 

If the measurements consistently show a similar battery life duration each time you repeat the test, it indicates that your measurement method is reliable. The consistent results under the same conditions assure you that the battery life measurement can be trusted to provide dependable information about the phone’s performance.

Example of validity

Researchers collect data from a group of participants in a study aiming to assess the validity of a newly developed stress questionnaire. To ensure validity, they compare the scores obtained from the stress questionnaire with the participants’ actual stress levels measured using physiological indicators such as heart rate variability and cortisol levels. 

If participants’ scores correlate strongly with their physiological stress levels, the questionnaire is valid. This means the questionnaire accurately measures participants’ stress levels, and its results correspond to real variations in their physiological responses to stress. 

Validity assessed through the correlation between questionnaire scores and physiological measures ensures that the questionnaire is effectively measuring what it claims to measure participants’ stress levels.

In the world of research, differentiating between reliability and validity is crucial. Reliability ensures consistent results, while validity confirms accurate measurements. Using tools like QuestionPro enhances data collection for both reliability and validity. For instance, measuring self-esteem over time showcases reliability, and aligning questions with theories demonstrates validity. 

QuestionPro empowers researchers to achieve reliable and valid results through its robust features, facilitating credible research outcomes. Contact QuestionPro to create a free account or learn more!

LEARN MORE         FREE TRIAL

MORE LIKE THIS

customer marketing

Customer Marketing: The Best Kept Secret of Big Brands

Jul 8, 2024

positive correlation

Positive Correlation: What It Is, Importance & How It Works

Jul 5, 2024

exit interviews

Exit Interviews: Transforming Departures into Growth Opportunities

Jul 4, 2024

Techaton QuestionPro

Techathon by QuestionPro: An Amazing Showcase of Tech Brilliance

Jul 3, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Family Med Prim Care
  • v.4(3); Jul-Sep 2015

Validity, reliability, and generalizability in qualitative research

Lawrence leung.

1 Department of Family Medicine, Queen's University, Kingston, Ontario, Canada

2 Centre of Studies in Primary Care, Queen's University, Kingston, Ontario, Canada

In general practice, qualitative research contributes as significantly as quantitative research, in particular regarding psycho-social aspects of patient-care, health services provision, policy setting, and health administrations. In contrast to quantitative research, qualitative research as a whole has been constantly critiqued, if not disparaged, by the lack of consensus for assessing its quality and robustness. This article illustrates with five published studies how qualitative research can impact and reshape the discipline of primary care, spiraling out from clinic-based health screening to community-based disease monitoring, evaluation of out-of-hours triage services to provincial psychiatric care pathways model and finally, national legislation of core measures for children's healthcare insurance. Fundamental concepts of validity, reliability, and generalizability as applicable to qualitative research are then addressed with an update on the current views and controversies.

Nature of Qualitative Research versus Quantitative Research

The essence of qualitative research is to make sense of and recognize patterns among words in order to build up a meaningful picture without compromising its richness and dimensionality. Like quantitative research, the qualitative research aims to seek answers for questions of “how, where, when who and why” with a perspective to build a theory or refute an existing theory. Unlike quantitative research which deals primarily with numerical data and their statistical interpretations under a reductionist, logical and strictly objective paradigm, qualitative research handles nonnumerical information and their phenomenological interpretation, which inextricably tie in with human senses and subjectivity. While human emotions and perspectives from both subjects and researchers are considered undesirable biases confounding results in quantitative research, the same elements are considered essential and inevitable, if not treasurable, in qualitative research as they invariable add extra dimensions and colors to enrich the corpus of findings. However, the issue of subjectivity and contextual ramifications has fueled incessant controversies regarding yardsticks for quality and trustworthiness of qualitative research results for healthcare.

Impact of Qualitative Research upon Primary Care

In many ways, qualitative research contributes significantly, if not more so than quantitative research, to the field of primary care at various levels. Five qualitative studies are chosen to illustrate how various methodologies of qualitative research helped in advancing primary healthcare, from novel monitoring of chronic obstructive pulmonary disease (COPD) via mobile-health technology,[ 1 ] informed decision for colorectal cancer screening,[ 2 ] triaging out-of-hours GP services,[ 3 ] evaluating care pathways for community psychiatry[ 4 ] and finally prioritization of healthcare initiatives for legislation purposes at national levels.[ 5 ] With the recent advances of information technology and mobile connecting device, self-monitoring and management of chronic diseases via tele-health technology may seem beneficial to both the patient and healthcare provider. Recruiting COPD patients who were given tele-health devices that monitored lung functions, Williams et al. [ 1 ] conducted phone interviews and analyzed their transcripts via a grounded theory approach, identified themes which enabled them to conclude that such mobile-health setup and application helped to engage patients with better adherence to treatment and overall improvement in mood. Such positive findings were in contrast to previous studies, which opined that elderly patients were often challenged by operating computer tablets,[ 6 ] or, conversing with the tele-health software.[ 7 ] To explore the content of recommendations for colorectal cancer screening given out by family physicians, Wackerbarth, et al. [ 2 ] conducted semi-structure interviews with subsequent content analysis and found that most physicians delivered information to enrich patient knowledge with little regard to patients’ true understanding, ideas, and preferences in the matter. These findings suggested room for improvement for family physicians to better engage their patients in recommending preventative care. Faced with various models of out-of-hours triage services for GP consultations, Egbunike et al. [ 3 ] conducted thematic analysis on semi-structured telephone interviews with patients and doctors in various urban, rural and mixed settings. They found that the efficiency of triage services remained a prime concern from both users and providers, among issues of access to doctors and unfulfilled/mismatched expectations from users, which could arouse dissatisfaction and legal implications. In UK, a care pathways model for community psychiatry had been introduced but its benefits were unclear. Khandaker et al. [ 4 ] hence conducted a qualitative study using semi-structure interviews with medical staff and other stakeholders; adopting a grounded-theory approach, major themes emerged which included improved equality of access, more focused logistics, increased work throughput and better accountability for community psychiatry provided under the care pathway model. Finally, at the US national level, Mangione-Smith et al. [ 5 ] employed a modified Delphi method to gather consensus from a panel of nominators which were recognized experts and stakeholders in their disciplines, and identified a core set of quality measures for children's healthcare under the Medicaid and Children's Health Insurance Program. These core measures were made transparent for public opinion and later passed on for full legislation, hence illustrating the impact of qualitative research upon social welfare and policy improvement.

Overall Criteria for Quality in Qualitative Research

Given the diverse genera and forms of qualitative research, there is no consensus for assessing any piece of qualitative research work. Various approaches have been suggested, the two leading schools of thoughts being the school of Dixon-Woods et al. [ 8 ] which emphasizes on methodology, and that of Lincoln et al. [ 9 ] which stresses the rigor of interpretation of results. By identifying commonalities of qualitative research, Dixon-Woods produced a checklist of questions for assessing clarity and appropriateness of the research question; the description and appropriateness for sampling, data collection and data analysis; levels of support and evidence for claims; coherence between data, interpretation and conclusions, and finally level of contribution of the paper. These criteria foster the 10 questions for the Critical Appraisal Skills Program checklist for qualitative studies.[ 10 ] However, these methodology-weighted criteria may not do justice to qualitative studies that differ in epistemological and philosophical paradigms,[ 11 , 12 ] one classic example will be positivistic versus interpretivistic.[ 13 ] Equally, without a robust methodological layout, rigorous interpretation of results advocated by Lincoln et al. [ 9 ] will not be good either. Meyrick[ 14 ] argued from a different angle and proposed fulfillment of the dual core criteria of “transparency” and “systematicity” for good quality qualitative research. In brief, every step of the research logistics (from theory formation, design of study, sampling, data acquisition and analysis to results and conclusions) has to be validated if it is transparent or systematic enough. In this manner, both the research process and results can be assured of high rigor and robustness.[ 14 ] Finally, Kitto et al. [ 15 ] epitomized six criteria for assessing overall quality of qualitative research: (i) Clarification and justification, (ii) procedural rigor, (iii) sample representativeness, (iv) interpretative rigor, (v) reflexive and evaluative rigor and (vi) transferability/generalizability, which also double as evaluative landmarks for manuscript review to the Medical Journal of Australia. Same for quantitative research, quality for qualitative research can be assessed in terms of validity, reliability, and generalizability.

Validity in qualitative research means “appropriateness” of the tools, processes, and data. Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the results and conclusions are valid for the sample and context. In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of “individual” is seen differently between humanistic and positive psychologists due to differing philosophical perspectives:[ 16 ] Where humanistic psychologists believe “individual” is a product of existential awareness and social interaction, positive psychologists think the “individual” exists side-by-side with formation of any human being. Set off in different pathways, qualitative research regarding the individual's wellbeing will be concluded with varying validity. Choice of methodology must enable detection of findings/phenomena in the appropriate context for it to be valid, with due regard to culturally and contextually variable. For sampling, procedures and methods must be appropriate for the research paradigm and be distinctive between systematic,[ 17 ] purposeful[ 18 ] or theoretical (adaptive) sampling[ 19 , 20 ] where the systematic sampling has no a priori theory, purposeful sampling often has a certain aim or framework and theoretical sampling is molded by the ongoing process of data collection and theory in evolution. For data extraction and analysis, several methods were adopted to enhance validity, including 1 st tier triangulation (of researchers) and 2 nd tier triangulation (of resources and theories),[ 17 , 21 ] well-documented audit trail of materials and processes,[ 22 , 23 , 24 ] multidimensional analysis as concept- or case-orientated[ 25 , 26 ] and respondent verification.[ 21 , 27 ]

Reliability

In quantitative research, reliability refers to exact replicability of the processes and the results. In qualitative research with diverse paradigms, such definition of reliability is challenging and epistemologically counter-intuitive. Hence, the essence of reliability for qualitative research lies with consistency.[ 24 , 28 ] A margin of variability for results is tolerated in qualitative research provided the methodology and epistemological logistics consistently yield data that are ontologically similar but may differ in richness and ambience within similar dimensions. Silverman[ 29 ] proposed five approaches in enhancing the reliability of process and results: Refutational analysis, constant data comparison, comprehensive data use, inclusive of the deviant case and use of tables. As data were extracted from the original sources, researchers must verify their accuracy in terms of form and context with constant comparison,[ 27 ] either alone or with peers (a form of triangulation).[ 30 ] The scope and analysis of data included should be as comprehensive and inclusive with reference to quantitative aspects if possible.[ 30 ] Adopting the Popperian dictum of falsifiability as essence of truth and science, attempted to refute the qualitative data and analytes should be performed to assess reliability.[ 31 ]

Generalizability

Most qualitative research studies, if not all, are meant to study a specific issue or phenomenon in a certain population or ethnic group, of a focused locality in a particular context, hence generalizability of qualitative research findings is usually not an expected attribute. However, with rising trend of knowledge synthesis from qualitative research via meta-synthesis, meta-narrative or meta-ethnography, evaluation of generalizability becomes pertinent. A pragmatic approach to assessing generalizability for qualitative studies is to adopt same criteria for validity: That is, use of systematic sampling, triangulation and constant comparison, proper audit and documentation, and multi-dimensional theory.[ 17 ] However, some researchers espouse the approach of analytical generalization[ 32 ] where one judges the extent to which the findings in one study can be generalized to another under similar theoretical, and the proximal similarity model, where generalizability of one study to another is judged by similarities between the time, place, people and other social contexts.[ 33 ] Thus said, Zimmer[ 34 ] questioned the suitability of meta-synthesis in view of the basic tenets of grounded theory,[ 35 ] phenomenology[ 36 ] and ethnography.[ 37 ] He concluded that any valid meta-synthesis must retain the other two goals of theory development and higher-level abstraction while in search of generalizability, and must be executed as a third level interpretation using Gadamer's concepts of the hermeneutic circle,[ 38 , 39 ] dialogic process[ 38 ] and fusion of horizons.[ 39 ] Finally, Toye et al. [ 40 ] reported the practicality of using “conceptual clarity” and “interpretative rigor” as intuitive criteria for assessing quality in meta-ethnography, which somehow echoed Rolfe's controversial aesthetic theory of research reports.[ 41 ]

Food for Thought

Despite various measures to enhance or ensure quality of qualitative studies, some researchers opined from a purist ontological and epistemological angle that qualitative research is not a unified, but ipso facto diverse field,[ 8 ] hence any attempt to synthesize or appraise different studies under one system is impossible and conceptually wrong. Barbour argued from a philosophical angle that these special measures or “technical fixes” (like purposive sampling, multiple-coding, triangulation, and respondent validation) can never confer the rigor as conceived.[ 11 ] In extremis, Rolfe et al. opined from the field of nursing research, that any set of formal criteria used to judge the quality of qualitative research are futile and without validity, and suggested that any qualitative report should be judged by the form it is written (aesthetic) and not by the contents (epistemic).[ 41 ] Rolfe's novel view is rebutted by Porter,[ 42 ] who argued via logical premises that two of Rolfe's fundamental statements were flawed: (i) “The content of research report is determined by their forms” may not be a fact, and (ii) that research appraisal being “subject to individual judgment based on insight and experience” will mean those without sufficient experience of performing research will be unable to judge adequately – hence an elitist's principle. From a realism standpoint, Porter then proposes multiple and open approaches for validity in qualitative research that incorporate parallel perspectives[ 43 , 44 ] and diversification of meanings.[ 44 ] Any work of qualitative research, when read by the readers, is always a two-way interactive process, such that validity and quality has to be judged by the receiving end too and not by the researcher end alone.

In summary, the three gold criteria of validity, reliability and generalizability apply in principle to assess quality for both quantitative and qualitative research, what differs will be the nature and type of processes that ontologically and epistemologically distinguish between the two.

Source of Support: Nil.

Conflict of Interest: None declared.

Research-Methodology

Research validity in surveys relates to the extent at which the survey measures right elements that need to be measured. In simple terms, validity refers to how well an instrument as measures what it is intended to measure.

Reliability alone is not enough, measures need to be reliable, as well as, valid. For example, if a weight measuring scale is wrong by 4kg (it deducts 4 kg of the actual weight), it can be specified as reliable, because the scale displays the same weight every time we measure a specific item. However, the scale is not valid because it does not display the actual weight of the item.

Research validity can be divided into two groups: internal and external. It can be specified that “internal validity refers to how the research findings match reality, while external validity refers to the extend to which the research findings can be replicated to other environments” (Pelissier, 2008, p.12).

Moreover, validity can also be divided into five types:

1. Face Validity is the most basic type of validity and it is associated with a highest level of subjectivity because it is not based on any scientific approach. In other words, in this case a test may be specified as valid by a researcher because it may seem as valid, without an in-depth scientific justification.

Example: questionnaire design for a study that analyses the issues of employee performance can be assessed as valid because each individual question may seem to be addressing specific and relevant aspects of employee performance.

2. Construct Validity relates to assessment of suitability of measurement tool to measure the phenomenon being studied. Application of construct validity can be effectively facilitated with the involvement of panel of ‘experts’ closely familiar with the measure and the phenomenon.

Example: with the application of construct validity the levels of leadership competency in any given organisation can be effectively assessed by devising questionnaire to be answered by operational level employees and asking questions about the levels of their motivation to do their duties in a daily basis.

3. Criterion-Related Validity involves comparison of tests results with the outcome. This specific type of validity correlates results of assessment with another criterion of assessment.

Example: nature of customer perception of brand image of a specific company can be assessed via organising a focus group. The same issue can also be assessed through devising questionnaire to be answered by current and potential customers of the brand. The higher the level of correlation between focus group and questionnaire findings, the high the level of criterion-related validity.

4. Formative Validity refers to assessment of effectiveness of the measure in terms of providing information that can be used to improve specific aspects of the phenomenon.

Example: when developing initiatives to increase the levels of effectiveness of organisational culture if the measure is able to identify specific weaknesses of organisational culture such as employee-manager communication barriers, then the level of formative validity of the measure can be assessed as adequate.

5. Sampling Validity (similar to content validity) ensures that the area of coverage of the measure within the research area is vast. No measure is able to cover all items and elements within the phenomenon, therefore, important items and elements are selected using a specific pattern of sampling method depending on aims and objectives of the study.

Example: when assessing a leadership style exercised in a specific organisation, assessment of decision-making style would not suffice, and other issues related to leadership style such as organisational culture, personality of leaders, the nature of the industry etc. need to be taken into account as well.

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Research Validity

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Reliability vs Validity: Differences & Examples

By Jim Frost 1 Comment

Reliability and validity are criteria by which researchers assess measurement quality. Measuring a person or item involves assigning scores to represent an attribute. This process creates the data that we analyze. However, to provide meaningful research results, that data must be good. And not all data are good!

Check mark indicating that the researchers have assessed measurement reliability and validity.

For data to be good enough to allow you to draw meaningful conclusions from a research study, they must be reliable and valid. What are the properties of good measurements? In a nutshell, reliability relates to the consistency of measures, and validity addresses whether the measurements are quantifying the correct attribute.

In this post, learn about reliability vs. validity, their relationship, and the various ways to assess them.

Learn more about Experimental Design: Definition, Types, and Examples .

Reliability

Reliability refers to the consistency of the measure. High reliability indicates that the measurement system produces similar results under the same conditions. If you measure the same item or person multiple times, you want to obtain comparable values. They are reproducible.

If you take measurements multiple times and obtain very different values, your data are unreliable. Numbers are meaningless if repeated measures do not produce similar values. What’s the correct value? No one knows! This inconsistency hampers your ability to draw conclusions and understand relationships.

Suppose you have a bathroom scale that displays very inconsistent results from one time to the next. It’s very unreliable. It would be hard to use your scale to determine your correct weight and to know whether you are losing weight.

Inadequate data collection procedures and low-quality or defective data collection tools can produce unreliable data. Additionally, some characteristics are more challenging to measure reliably. For example, the length of an object is concrete. On the other hand, a psychological construct, such as conscientiousness, depression, and self-esteem, can be trickier to measure reliably.

When assessing studies, evaluate data collection methodologies and consider whether any issues undermine their reliability.

Validity refers to whether the measurements reflect what they’re supposed to measure. This concept is a broader issue than reliability. Researchers need to consider whether they’re measuring what they think they’re measuring. Or do the measurements reflect something else? Does the instrument measure what it says it measures? It’s a question that addresses the appropriateness of the data rather than whether measurements are repeatable.

Validity is a smaller concern for tangible measurements like height and weight. You might have a biased bathroom scale if it tends to read too high or too low—but it still measures weight. Validity is a bigger concern in the social sciences, where you can measure elusive concepts such as positive outlook and self-esteem. If you’re assessing the psychological construct of conscientiousness, you need to confirm that the instrument poses questions that appraise this attribute rather than, say, obedience.

Reliability vs Validity

A measurement must be reliable first before it has a chance of being valid. After all, if you don’t obtain consistent measurements for the same object or person under similar conditions, it can’t be valid. If your scale displays a different weight every time you step on it, it’s unreliable, and it is also invalid.

So, having reliable measurements is the first step towards having valid measures. Validity is necessary for reliability, but it is insufficient by itself.

Suppose you have a reliable measurement. You step on your scale a few times in a short period, and it displays very similar weights. It’s reliable. But the weight might be incorrect.

Just because you can measure the same object multiple times and get consistent values, it does not necessarily indicate that the measurements reflect the desired characteristic.

How can you determine whether measurements are both valid and reliable? Assessing reliability vs. validity is the topic for the rest of this post!

Similar measurements for the same person/item under the same conditions. Measurements reflect what they’re supposed to measure.
Stability of results across time, between observers, within the test. Measures have appropriate relationships to theories, similar measures, and different measures.
Unreliable measurements typically cannot be valid. Valid measurements are also reliable.

How to Assess Reliability

Reliability relates to measurement consistency. To evaluate reliability, analysts assess consistency over time, within the measurement instrument, and between different observers. These types of consistency are also known as—test-retest, internal, and inter-rater reliability. Typically, appraising these forms of reliability involves taking multiple measures of the same person, object, or construct and assessing scatterplots and correlations of the measurements. Reliable measurements have high correlations because the scores are similar.

Test-Retest Reliability

Analysts often assume that measurements should be consistent across a short time. If you measure your height twice over a couple of days, you should obtain roughly the same measurements.

To assess test-retest reliability, the experimenters typically measure a group of participants on two occasions within a few days. Usually, you’ll evaluate the reliability of the repeated measures using scatterplots and correlation coefficients . You expect to see high correlations and tight lines on the scatterplot when the characteristic you measure is consistent over a short period, and you have a reliable measurement system.

This type of reliability establishes the degree to which a test can produce stable, consistent scores across time. However, in practice, measurement instruments are never entirely consistent.

Keep in mind that some characteristics should not be consistent across time. A good example is your mood, which can change from moment to moment. A test-retest assessment of mood is not likely to produce a high correlation even though it might be a useful measurement instrument.

Internal Reliability

This type of reliability assesses consistency across items within a single instrument. Researchers evaluate internal reliability when they’re using instruments such as a survey or personality inventories. In these instruments, multiple items relate to a single construct. Questions that measure the same characteristic should have a high correlation. People who indicate they are risk-takers should also note that they participate in dangerous activities. If items that supposedly measure the same underlying construct have a low correlation, they are not consistent with each other and might not measure the same thing.

Inter-Rater Reliability

This type of reliability assesses consistency across different observers, judges, or evaluators. When various observers produce similar measurements for the same item or person, their scores are highly correlated. Inter-rater reliability is essential when the subjectivity or skill of the evaluator plays a role. For example, assessing the quality of a writing sample involves subjectivity. Researchers can employ rating guidelines to reduce subjectivity. Comparing the scores from different evaluators for the same writing sample helps establish the measure’s reliability. Learn more about inter-rater reliability .

Related post : Interpreting Correlation

Cronbach’s Alpha

Cronbach’s alpha measures the internal consistency, or reliability, of a set of survey items. Use this statistic to help determine whether a collection of items consistently measures the same characteristic. Learn more about Cronbach’s Alpha .

Gage R&R Studies

These studies evaluation a measurement systems reliability and identifies sources of variation that can help you target improvement efforts effectively. Learn more about Gage R&R Studies .

How to Assess Validity

Validity is more difficult to evaluate than reliability. After all, with reliability, you only assess whether the measures are consistent across time, within the instrument, and between observers. On the other hand, evaluating validity involves determining whether the instrument measures the correct characteristic. This process frequently requires examining relationships between these measurements, other data, and theory. Validating a measurement instrument requires you to use a wide range of subject-area knowledge and different types of constructs to determine whether the measurements from your instrument fit in with the bigger picture!

An instrument with high validity produces measurements that correctly fit the larger picture with other constructs. Validity assesses whether the web of empirical relationships aligns with the theoretical relationships.

The measurements must have a positive relationship with other measures of the same construct. Additionally, they need to correlate in the correct direction (positively or negatively) with the theoretically correct constructs. Finally, the measures should have no relationship with unrelated constructs.

If you need more detailed information, read my post that focuses on Measurement Validity . In that post, I cover the various types, how to evaluate them, and provide examples.

Experimental validity relates to experimental designs and methods. To learn about that topic, read my post about Internal and External Validity .

Whew, that’s a lot of information about reliability vs. validity. Using these concepts, you can determine whether a measurement instrument produces good data!

Share this:

research validity meaning

Reader Interactions

' src=

August 17, 2022 at 3:53 am

Good way of expressing what validity and reliabiliy with building examples.

Comments and Questions Cancel reply

  • Privacy Policy

Research Method

Home » Reliability Vs Validity

Reliability Vs Validity

Table of Contents

Reliability Vs Validity

Reliability and validity are two important concepts in research that are used to evaluate the quality of measurement instruments or research studies.

Reliability

Reliability refers to the degree to which a measurement instrument or research study produces consistent and stable results over time, across different observers or raters, or under different conditions.

In other words, reliability is the extent to which a measurement instrument or research study produces results that are free from random error. A reliable measurement instrument or research study should produce similar results each time it is used or conducted, regardless of who is using it or conducting it.

Validity, on the other hand, refers to the degree to which a measurement instrument or research study accurately measures what it is supposed to measure or tests what it is supposed to test.

In other words, validity is the extent to which a measurement instrument or research study measures or tests what it claims to measure or test. A valid measurement instrument or research study should produce results that accurately reflect the concept or construct being measured or tested.

Difference Between Reliability Vs Validity

Here’s a comparison table that highlights the differences between reliability and validity:

ReliabilityValidity
The degree to which a measurement instrument or research study produces consistent and stable results over time, across different observers or raters, or under different conditions.The degree to which a measurement instrument or research study accurately measures what it is supposed to measure or tests what it is supposed to test.
Consistency and stability of resultsAccuracy and truthfulness of results
Test-retest reliability, inter-rater reliability, internal consistency reliabilityContent validity, criterion validity, construct validity
Degree of agreement or correlation between repeated measures or observersDegree of association between a measure and an external criterion, or degree to which a measure assesses the intended construct
A bathroom scale that consistently provides the same weight measurement when used multiple times in a rowA math test that measures only the math skills it is intended to test and not other factors, such as test-taking anxiety or language ability.

Also see Research Methods

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Inter-Rater Reliability

Inter-Rater Reliability – Methods, Examples and...

Face Validity

Face Validity – Methods, Types, Examples

Qualitative Vs Quantitative Research

Qualitative Vs Quantitative Research

Criterion Validity

Criterion Validity – Methods, Examples and...

Split-Half Reliability

Split-Half Reliability – Methods, Examples and...

Primary Vs Secondary Research

Primary Vs Secondary Research

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

Internal Validity vs. External Validity in Research

What they tell us about the meaningfulness and trustworthiness of research

Arlin Cuncic, MA, is the author of The Anxiety Workbook and founder of the website About Social Anxiety. She has a Master's degree in clinical psychology.

research validity meaning

Rachel Goldman, PhD FTOS, is a licensed psychologist, clinical assistant professor, speaker, wellness expert specializing in eating behaviors, stress management, and health behavior change.

research validity meaning

Verywell / Bailey Mariner

  • Internal Validity
  • External Validity

How do you determine whether a psychology study is trustworthy and meaningful? Two characteristics that can help you assess research findings are internal and external validity.

  • Internal validity measures how well a study is conducted (its structure) and how accurately its results reflect the studied group.
  • External validity relates to how applicable the findings are in the real world.

These two concepts help researchers gauge if the results of a research study are trustworthy and meaningful.

Conclusions are warranted

Controls extraneous variables

Eliminates alternative explanations

Focus on accuracy and strong research methods

Findings can be generalized

Outcomes apply to practical situations

Results apply to the world at large

Results can be translated into another context

What Is Internal Validity in Research?

Internal validity is the extent to which a research study establishes a trustworthy cause-and-effect relationship. This type of validity depends largely on the study's procedures and how rigorously it is performed.

Internal validity is important because once established, it makes it possible to eliminate alternative explanations for a finding. If you implement a smoking cessation program, for instance, internal validity ensures that any improvement in the subjects is due to the treatment administered and not something else.

Internal validity is not a "yes or no" concept. Instead, we consider how confident we can be with study findings based on whether the research avoids traps that may make those findings questionable. The less chance there is for "confounding," the higher the internal validity and the more confident we can be.

Confounding refers to uncontrollable variables that come into play and can confuse the outcome of a study, making us unsure of whether we can trust that we have identified the cause-and-effect relationship.

In short, you can only be confident that a study is internally valid if you can rule out alternative explanations for the findings. Three criteria are required to assume cause and effect in a research study:

  • The cause preceded the effect in terms of time.
  • The cause and effect vary together.
  • There are no other likely explanations for the relationship observed.

Factors That Improve Internal Validity

To ensure the internal validity of a study, you want to consider aspects of the research design that will increase the likelihood that you can reject alternative hypotheses. Many factors can improve internal validity in research, including:

  • Blinding : Participants—and sometimes researchers—are unaware of what intervention they are receiving (such as using a placebo on some subjects in a medication study) to avoid having this knowledge bias their perceptions and behaviors, thus impacting the study's outcome
  • Experimental manipulation : Manipulating an independent variable in a study (for instance, giving smokers a cessation program) instead of just observing an association without conducting any intervention (examining the relationship between exercise and smoking behavior)
  • Random selection : Choosing participants at random or in a manner in which they are representative of the population that you wish to study
  • Randomization or random assignment : Randomly assigning participants to treatment and control groups, ensuring that there is no systematic bias between the research groups
  • Strict study protocol : Following specific procedures during the study so as not to introduce any unintended effects; for example, doing things differently with one group of study participants than you do with another group

Internal Validity Threats

Just as there are many ways to ensure internal validity, a list of potential threats should be considered when planning a study.

  • Attrition : Participants dropping out or leaving a study, which means that the results are based on a biased sample of only the people who did not choose to leave (and possibly who all have something in common, such as higher motivation)
  • Confounding : A situation in which changes in an outcome variable can be thought to have resulted from some type of outside variable not measured or manipulated in the study
  • Diffusion : This refers to the results of one group transferring to another through the groups interacting and talking with or observing one another; this can also lead to another issue called resentful demoralization, in which a control group tries less hard because they feel resentful over the group that they are in
  • Experimenter bias : An experimenter behaving in a different way with different groups in a study, which can impact the results (and is eliminated through blinding)
  • Historical events : May influence the outcome of studies that occur over a period of time, such as a change in the political leader or a natural disaster that occurs, influencing how study participants feel and act
  • Instrumentation : This involves "priming" participants in a study in certain ways with the measures used, causing them to react in a way that is different than they would have otherwise reacted
  • Maturation : The impact of time as a variable in a study; for example, if a study takes place over a period of time in which it is possible that participants naturally change in some way (i.e., they grew older or became tired), it may be impossible to rule out whether effects seen in the study were simply due to the impact of time
  • Statistical regression : The natural effect of participants at extreme ends of a measure falling in a certain direction due to the passage of time rather than being a direct effect of an intervention
  • Testing : Repeatedly testing participants using the same measures influences outcomes; for example, if you give someone the same test three times, it is likely that they will do better as they learn the test or become used to the testing process, causing them to answer differently

What Is External Validity in Research?

External validity refers to how well the outcome of a research study can be expected to apply to other settings. This is important because, if external validity is established, it means that the findings can be generalizable to similar individuals or populations.

External validity affirmatively answers the question: Do the findings apply to similar people, settings, situations, and time periods?

Population validity and ecological validity are two types of external validity. Population validity refers to whether you can generalize the research outcomes to other populations or groups. Ecological validity refers to whether a study's findings can be generalized to additional situations or settings.

Another term called transferability refers to whether results transfer to situations with similar characteristics. Transferability relates to external validity and refers to a qualitative research design.

Factors That Improve External Validity

If you want to improve the external validity of your study, there are many ways to achieve this goal. Factors that can enhance external validity include:

  • Field experiments : Conducting a study outside the laboratory, in a natural setting
  • Inclusion and exclusion criteria : Setting criteria as to who can be involved in the research, ensuring that the population being studied is clearly defined
  • Psychological realism : Making sure participants experience the events of the study as being real by telling them a "cover story," or a different story about the aim of the study so they don't behave differently than they would in real life based on knowing what to expect or knowing the study's goal
  • Replication : Conducting the study again with different samples or in different settings to see if you get the same results; when many studies have been conducted on the same topic, a meta-analysis can also be used to determine if the effect of an independent variable can be replicated, therefore making it more reliable
  • Reprocessing or calibration : Using statistical methods to adjust for external validity issues, such as reweighting groups if a study had uneven groups for a particular characteristic (such as age)

External Validity Threats

External validity is threatened when a study does not take into account the interaction of variables in the real world. Threats to external validity include:

  • Pre- and post-test effects : When the pre- or post-test is in some way related to the effect seen in the study, such that the cause-and-effect relationship disappears without these added tests
  • Sample features : When some feature of the sample used was responsible for the effect (or partially responsible), leading to limited generalizability of the findings
  • Selection bias : Also considered a threat to internal validity, selection bias describes differences between groups in a study that may relate to the independent variable—like motivation or willingness to take part in the study, or specific demographics of individuals being more likely to take part in an online survey
  • Situational factors : Factors such as the time of day of the study, its location, noise, researcher characteristics, and the number of measures used may affect the generalizability of findings

While rigorous research methods can ensure internal validity, external validity may be limited by these methods.

Internal Validity vs. External Validity

Internal validity and external validity are two research concepts that share a few similarities while also having several differences.

Similarities

One of the similarities between internal validity and external validity is that both factors should be considered when designing a study. This is because both have implications in terms of whether the results of a study have meaning.

Both internal validity and external validity are not "either/or" concepts. Therefore, you always need to decide to what degree a study performs in terms of each type of validity.

Each of these concepts is also typically reported in research articles published in scholarly journals . This is so that other researchers can evaluate the study and make decisions about whether the results are useful and valid.

Differences

The essential difference between internal validity and external validity is that internal validity refers to the structure of a study (and its variables) while external validity refers to the universality of the results. But there are further differences between the two as well.

For instance, internal validity focuses on showing a difference that is due to the independent variable alone. Conversely, external validity results can be translated to the world at large.

Internal validity and external validity aren't mutually exclusive. You can have a study with good internal validity but be overall irrelevant to the real world. You could also conduct a field study that is highly relevant to the real world but doesn't have trustworthy results in terms of knowing what variables caused the outcomes.

Examples of Validity

Perhaps the best way to understand internal validity and external validity is with examples.

Internal Validity Example

An example of a study with good internal validity would be if a researcher hypothesizes that using a particular mindfulness app will reduce negative mood. To test this hypothesis, the researcher randomly assigns a sample of participants to one of two groups: those who will use the app over a defined period and those who engage in a control task.

The researcher ensures that there is no systematic bias in how participants are assigned to the groups. They do this by blinding the research assistants so they don't know which groups the subjects are in during the experiment.

A strict study protocol is also used to outline the procedures of the study. Potential confounding variables are measured along with mood , such as the participants' socioeconomic status, gender, age, and other factors. If participants drop out of the study, their characteristics are examined to make sure there is no systematic bias in terms of who stays in.

External Validity Example

An example of a study with good external validity would be if, in the above example, the participants used the mindfulness app at home rather than in the laboratory. This shows that results appear in a real-world setting.

To further ensure external validity, the researcher clearly defines the population of interest and chooses a representative sample . They might also replicate the study's results using different technological devices.

Setting up an experiment so that it has both sound internal validity and external validity involves being mindful from the start about factors that can influence each aspect of your research.

It's best to spend extra time designing a structurally sound study that has far-reaching implications rather than to quickly rush through the design phase only to discover problems later on. Only when both internal validity and external validity are high can strong conclusions be made about your results.

Andrade C. Internal, external, and ecological validity in research design, conduct, and evaluation .  Indian J Psychol Med . 2018;40(5):498-499. doi:10.4103/IJPSYM.IJPSYM_334_18

San Jose State University. Internal and external validity .

Kemper CJ. Internal validity . In: Zeigler-Hill V, Shackelford TK, eds. Encyclopedia of Personality and Individual Differences . Springer International Publishing; 2017:1-3. doi:10.1007/978-3-319-28099-8_1316-1

Patino CM, Ferreira JC. Internal and external validity: can you apply research study results to your patients?   J Bras Pneumol . 2018;44(3):183. doi:10.1590/S1806-37562018000000164

Matthay EC, Glymour MM. A graphical catalog of threats to validity: Linking social science with epidemiology .  Epidemiology . 2020;31(3):376-384. doi:10.1097/EDE.0000000000001161

Amico KR. Percent total attrition: a poor metric for study rigor in hosted intervention designs .  Am J Public Health . 2009;99(9):1567-1575. doi:10.2105/AJPH.2008.134767

Kemper CJ. External validity . In: Zeigler-Hill V, Shackelford TK, eds. Encyclopedia of Personality and Individual Differences . Springer International Publishing; 2017:1-4. doi:10.1007/978-3-319-28099-8_1303-1

Desjardins E, Kurtz J, Kranke N, Lindeza A, Richter SH. Beyond standardization: improving external validity and reproducibility in experimental evolution . BioScience. 2021;71(5):543-552. doi:10.1093/biosci/biab008

Drude NI, Martinez Gamboa L, Danziger M, Dirnagl U, Toelch U. Improving preclinical studies through replications .  Elife . 2021;10:e62101. doi:10.7554/eLife.62101

Michael RS. Threats to internal & external validity: Y520 strategies for educational inquiry .

Pahus L, Burgel PR, Roche N, Paillasseur JL, Chanez P. Randomized controlled trials of pharmacological treatments to prevent COPD exacerbations: applicability to real-life patients . BMC Pulm Med . 2019;19(1):127. doi:10.1186/s12890-019-0882-y

By Arlin Cuncic, MA Arlin Cuncic, MA, is the author of The Anxiety Workbook and founder of the website About Social Anxiety. She has a Master's degree in clinical psychology.

Validity, Accuracy and Reliability Explained with Examples

This is part of the NSW HSC science curriculum part of the Working Scientifically skills.

Part 1 – Validity

Part 2 – Accuracy

Part 3 – Reliability

Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using simple experiments as examples.

Target Analogy to Understand Accuracy and Reliability

The target analogy is a classic way to understand the concepts of accuracy and reliability in scientific measurements and experiments. 

research validity meaning

Accuracy refers to how close a measurement is to the true or accepted value. In the analogy, it's how close the arrows come to hitting the bullseye (represents the true or accepted value).

Reliability  refers to the consistency of a set of measurements. Reliable data can be reproduced under the same conditions. In the analogy, it's represented by how tightly the arrows are grouped together, regardless of whether they hit the bullseye. Therefore, we can have scientific results that are reliable but inaccurate.

  • Validity  refers to how well an experiment investigates the aim or tests the underlying hypothesis. While validity is not represented in this target analogy, the validity of an experiment can sometimes be assessed by using the accuracy of results as a proxy. Experiments that produce accurate results are likely to be valid as invalid experiments usually do not yield accurate result.

Validity refers to how well an experiment measures what it is supposed to measure and investigates the aim.

Ask yourself the questions:

  • "Is my experimental method and design suitable?"
  • "Is my experiment testing or investigating what it's suppose to?"

research validity meaning

For example, if you're investigating the effect of the volume of water (independent variable) on plant growth, your experiment would be valid if you measure growth factors like height or leaf size (these would be your dependent variables).

However, validity entails more than just what's being measured. When assessing validity, you should also examine how well the experimental methodology investigates the aim of the experiment.

Assessing Validity

An experiment’s procedure, the subsequent methods of analysis of the data, the data itself, and the conclusion you draw from the data, all have their own associated validities. It is important to understand this division because there are different factors to consider when assessing the validity of any single one of them. The validity of an experiment as a whole , depends on the individual validities of these components.

When assessing the validity of the procedure , consider the following:

  • Does the procedure control all necessary variables except for the dependent and independent variables? That is, have you isolated the effect of the independent variable on the dependent variable?
  • Does this effect you have isolated actually address the aim and/or hypothesis?
  • Does your method include enough repetitions for a reliable result? (Read more about reliability below)

When assessing the validity of the method of analysis of the data , consider the following:

  • Does the analysis extrapolate or interpolate the experimental data? Generally, interpolation is valid, but extrapolation is invalid. This because by extrapolating, you are ‘peering out into the darkness’ – just because your data showed a certain trend for a certain range it does not mean that this trend will hold for all.
  • Does the analysis use accepted laws and mathematical relationships? That is, do the equations used for analysis have scientific or mathematical base? For example, `F = ma` is an accepted law in physics, but if in the analysis you made up a relationship like `F = ma^2` that has no scientific or mathematical backing, the method of analysis is invalid.
  • Is the most appropriate method of analysis used? Consider the differences between using a table and a graph. In a graph, you can use the gradient to minimise the effects of systematic errors and can also reduce the effect of random errors. The visual nature of a graph also allows you to easily identify outliers and potentially exclude them from analysis. This is why graphical analysis is generally more valid than using values from tables.

When assessing the validity of your results , consider the following: 

  • Is your primary data (data you collected from your own experiment) BOTH accurate and reliable? If not, it is invalid.
  • Are the secondary sources you may have used BOTH reliable and accurate?

When assessing the validity of your conclusion , consider the following:

  • Does your conclusion relate directly to the aim or the hypothesis?

How to Improve Validity

Ways of improving validity will differ across experiments. You must first identify what area(s) of the experiment’s validity is lacking (is it the procedure, analysis, results, or conclusion?). Then, you must come up with ways of overcoming the particular weakness. 

Below are some examples of this.

Example – Validity in Chemistry Experiment 

Let's say we want to measure the mass of carbon dioxide in a can of soft drink.

Heating a can of soft drink

The following steps are followed:

  • Weigh an unopened can of soft drink on an electronic balance.
  • Open the can.
  • Place the can on a hot plate until it begins to boil.
  • When cool, re-weigh the can to determine the mass loss.

To ensure this experiment is valid, we must establish controlled variables:

  • type of soft drink used
  • temperature at which this experiment is conducted
  • period of time before soft drink is re-weighed

Despite these controlled variables, this experiment is invalid because it actually doesn't help us measure the mass of carbon dioxide in the soft drink. This is because by heating the soft drink until it boils, we are also losing water due to evaporation. As a result, the mass loss measured is not only due to the loss of carbon dioxide, but also water. A simple way to improve the validity of this experiment is to not heat it; by simply opening the can of soft drink, carbon dioxide in the can will escape without loss of water.

Example – Validity in Physics Experiment

Let's say we want to measure the value of gravitational acceleration `g` using a simple pendulum system, and the following equation:

$$T = 2\pi \sqrt{\frac{l}{g}}$$

  • `T` is the period of oscillation
  • `l` is the length of string attached to the mass
  • `g` is the acceleration due to gravity

Pendulum practical

  • Cut a piece of a string or dental floss so that it is 1.0 m long.
  • Attach a 500.0 g mass of high density to the end of the string.
  • Attach the other end of the string to the retort stand using a clamp.
  • Starting at an angle of less than 10º, allow the pendulum to swing and measure the pendulum’s period for 10 oscillations using a stopwatch.
  • Repeat the experiment with 1.2 m, 1.5 m and 1.8 m strings.

The controlled variables we must established in this experiment include:

  • mass used in the pendulum
  • location at which the experiment is conducted

The validity of this experiment depends on the starting angle of oscillation. The above equation (method of analysis) is only true for small angles (`\theta < 15^{\circ}`) such that `\sin \theta = \theta`. We also want to make sure the pendulum system has a small enough surface area to minimise the effect of air resistance on its oscillation.

research validity meaning

In this instance, it would be invalid to use a pair of values (length and period) to calculate the value of gravitational acceleration. A more appropriate method of analysis would be to plot the length and period squared to obtain a linear relationship, then use the value of the gradient of the line of best fit to determine the value of `g`. 

Accuracy refers to how close the experimental measurements are to the true value.

Accuracy depends on

  • the validity of the experiment
  • the degree of error:
  • systematic errors are those that are systemic in your experiment. That is, they effect every single one of your data points consistently, meaning that the cause of the error is always present. For example, it could be a badly calibrated temperature gauge that reports every reading 5 °C above the true value.
  • random errors are errors that occur inconsistently. For example, the temperature gauge readings might be affected by random fluctuations in room temperature. Some readings might be above the true value, some might then be below the true value.
  • sensitivity of equipment used.

Assessing Accuracy 

The effect of errors and insensitive equipment can both be captured by calculating the percentage error:

$$\text{% error} = \frac{\text{|experimental value – true value|}}{\text{true value}} \times 100%$$

Generally, measurements are considered accurate when the percentage error is less than 5%. You should always take the context of the experimental into account when assessing accuracy. 

While accuracy and validity have different definitions, the two are closely related. Accurate results often suggest that the underlying experiment is valid, as invalid experiments are unlikely to produce accurate results.

In a simple pendulum experiment, if your measurements of the pendulum's period are close to the calculated value, your experiment is accurate. A table showing sample experimental measurements vs accepted values from using the equation above is shown below. 

research validity meaning

All experimental values in the table above are within 5% of accepted (theoretical) values, they are therefore considered as accurate. 

How to Improve Accuracy

  • Remove systematic errors : for example, if the experiment’s measuring instruments are poorly calibrated, then you should correctly calibrate it before doing the experiment again.
  • Reduce the influence of random errors : this can be done by having more repetitions in the experiment and reporting the average values. This is because if you have enough of these random errors – some above the true value and some below the true value – then averaging them will make them cancel each other out This brings your average value closer and closer to the true value.
  • Use More Sensitive Equipments: For example, use a recording to measure time by analysing motion of an object frame by frame, instead of using a stopwatch. The sensitivity of an equipment can be measured by the limit of reading . For example, stopwatches may only measure to the nearest millisecond – that is their limit of reading. But recordings can be analysed to the frame. And, depending on the frame rate of the camera, this could mean measuring to the nearest microsecond.
  • Obtain More Measurements and Over a Wider Range:  In some cases, the relationship between two variables can be more accurately determined by testing over a wider range. For example, in the pendulum experiment, periods when strings of various lengths are used can be measured. In this instance, repeating the experiment does not relate to reliability because we have changed the value of the independent variable tested.

Reliability

Reliability involves the consistency of your results over multiple trials.

Assessing Reliability

The reliability of an experiment can be broken down into the reliability of the procedure and the reliability of the final results.

The reliability of the procedure refers to how consistently the steps of your experiment produce similar results. For example, if an experiment produces the same values every time it is repeated, then it is highly reliable. This can be assessed quantitatively by looking at the spread of measurements, using statistical tests such as greatest deviation from the mean, standard deviations, or z-scores.

Ask yourself: "Is my result reproducible?"

The reliability of results cannot be assessed if there is only one data point or measurement obtained in the experiment. There must be at least 3. When you're repeating the experiment to assess the reliability of its results, you must follow the  same steps , use the  same value  for the independent variable. Results obtained from methods with different steps cannot be assessed for their reliability.

Obtaining only one measurement in an experiment is not enough because it could be affected by errors and have been produced due to pure chance. Repeating the experiment and obtaining the same or similar results will increase your confidence that the results are reproducible (therefore reliable).

In the soft drink experiment, reliability can be assessed by repeating the steps at least three times:

reliable results example

The mass loss measured in all three trials are fairly consistent, suggesting that the reliability of the underly method is high.

The reliability of the final results refers to how consistently your final data points (e.g. average value of repeated trials) point towards the same trend. That is, how close are they all to the trend line? This can be assessed quantitatively using the `R^2` value. `R^2` value ranges between 0 and 1, a value of 0 suggests there is no correlation between data points, and a value of 1 suggests a perfect correlation with no variance from trend line.

In the pendulum experiment, we can calculate the `R^2` value (done in Excel) by using the final average period values measured for each pendulum length.

research validity meaning

Here, a `R^2` value of 0.9758 suggests the four average values are fairly close to the overall linear trend line (low variance from trend line). Thus, the results are fairly reliable. 

How to Improve Reliability

A common misconception is that increasing the number of trials increases the reliability of the procedure . This is not true. The only way to increase the reliability of the procedure is to revise it. This could mean using instruments that are less susceptible to random errors, which cause measurements to be more variable.

Increasing the number of trials actually increases the reliability of the final results . This is because having more repetitions reduces the influence of random errors and brings the average values closer to the true values. Generally, the closer experimental values are to true values, the closer they are to the true trend. That is, accurate data points are generally reliable and all point towards the same trend.

Reliable but Inaccurate / Invalid

It is important to understand that results from an experiment can be reliable (consistent), but inaccurate (deviate greatly from theoretical values) and/or invalid. In this case, your procedure  is reliable, but your final results likely are not.

Examples of Reliability

Using the soft drink example again, if the mass losses measured for three soft drinks (same brand and type of drink) are consistent, then it's reliable. 

Using the pendulum example again, if you get similar period measurements every time you repeat the experiment, it’s reliable.  

However, in both cases, if the underlying methods are invalid, the consistent results would be invalid and inaccurate (despite being reliable).

Do you have trouble understanding validity, accuracy or reliability in your science experiment or depth study?

Consider getting personalised help from our 1-on-1 mentoring program !

RETURN TO WORKING SCIENTIFICALLY

  • choosing a selection results in a full page refresh
  • press the space key then arrow keys to make a selection
  • Open access
  • Published: 08 July 2024

Evaluation of cross-cultural adaptation and validation of the Persian version of the critical thinking disposition scale: methodological study

  • Hossein Bakhtiari-Dovvombaygi 1 , 2 ,
  • Kosar Pourhasan 1 ,
  • Zahra Rahmaty 3 ,
  • Akbar Zare-Kaseb 1 ,
  • Abbas Abbaszadeh 2 ,
  • Amirreza Rashtbarzadeh 1 &
  • Fariba Borhani 2 , 4  

BMC Nursing volume  23 , Article number:  463 ( 2024 ) Cite this article

Metrics details

Introduction

Assessing critical thinking disposition is crucial in nursing education to foster analytical skills essential for effective healthcare practice. This study aimed to evaluate the cross-cultural adaptation and validation of the Persian version of the Critical Thinking Disposition Scale among Iranian nursing students.

A total of 390 nursing students (mean age = 21.74 (2.1) years; 64% female) participated in the study. Face and content validity were established through feedback from nursing students and expert specialists, respectively. Construct validity was assessed using exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). The EFA was used to explore the number of factors and the items that were loading on them. The CFA was used to confirmed the fidnings of the EFA on the same sample. Convergent and discriminant validity were examined, along with reliability through internal consistency and test-retest reliability.

EFA revealed a two-factor structure, comprising “Critical Openness” and “Reflective Skepticism,” explaining 55% of the total variance. CFA confirmed the model’s fit (χ² = 117.37, df = 43, χ²/df = 2.73, p  < 0.001; RMSEA = 0.067; CFI = 0.95; TLI = 0.93, SRMR = 0.041). Convergent and discriminant validity were supported, with significant factor loadings ( p  < 0.001) ranging from 0.61 to 0.77. The CTDS exhibited strong internal consistency (α = 0.87) and excellent test-retest reliability (ICC = 0.96).

The validation of the CTDS in Persian language settings provides a reliable tool for assessing critical thinking disposition among Iranian nursing students. The two-factor structure aligns with previous research, reflecting students’ propensity towards critical openness and reflective skepticism. The study’s findings underscore the importance of nurturing critical thinking skills in nursing education.

Peer Review reports

Critical thinking can be seen as the ability to think logically, dynamically, comprehensively and practically when judging a situation to investigate and make appropriate decisions [ 1 , 2 ]. This ability helps to gain insight and examine an idea or concept from different perspectives [ 3 ]. Critical thinking has become an educational ideal, with most policy makers and educationists calling for the development of critical attitudes in students [ 4 ]. Critical thinking has been identified as one of the most important outcomes of higher education courses [ 5 ].

There is increasing evidence showing that critical thinking is considered an important part of preregistered nursing students and registered nurses when they are working in various clinical practice settings [ 6 , 7 ]. Critical thinking is one of the basic skills that prepares nursing students to effectively manage patient problems, make the best clinical decisions, provide safe and high-quality care, and better control critical situations. On the other hand, negative consequences such as depression, failure to solve patient problems, and incomplete clinical reasoning can be consequences of poor critical thinking [ 8 , 9 ].

Critical thinking is expected in nursing program graduates at the international level [ 10 , 11 ]. Therefore, it is important to evaluate and measure the levels of critical thinking of nursing students at different stages and education so that educators can adjust learning activities to ensure the desired results [ 2 , 12 , 13 ]. Educators are the ones who are responsible for and have the opportunity to shape this skill during the years of education and trayning new generations [ 14 ]. Despite the importance of critical thinking in the nursing profession, studies have reported a lack of critical thinking skills among undergraduate students in the field [ 15 , 16 ].

Critical thinking has two main components: critical thinking skills and critical thinking disposition (CTD). The skills component refers to the cognitive processes of thinking, while the disposition component refers to personal desire and internal motivation for critical thinking [ 9 ]. Several studies have highlighted the need for reliable assessment tools for critical thinking, specifically in nursing, rather than in a general context [ 17 , 18 , 19 ].

To our knowledge, and based on our literature review, the CTD is the only specific tool for assessing the tendency to think critically. However, this tool has not been used or validated in the Iranian educational context, population and language. Considering the lack of effective tools for evaluating CTD in undergraduate nursing programs in Iran, the purpose of this study was to translate and evaluate the psychometric properties of the Persian version of CTD among nursing students.

Study design

This was a cross-sectional study utilizing cross-cultural adaptation to translate and investigate the validity and reliability of the CTDS for use among Iranian nursing students [ 20 ]. The translated scale underwent examination for reliability and validity tests.

Study population and sampling

Convenience sampling was employed at the School of Nursing, Shahid Beheshti University in Tehran. This method involved selecting participants who were readily available and willing to take part in the study. Specifically, the study targeted all undergraduate nursing students, who were invited to participate in the research. Recruitment continued until the desired sample size was achieved. To maintain the integrity of the data, students who submitted incomplete questionnaires were excluded from the analysis. Undergradute studies in Iran normally involve 4 years of education in general nursing, as well as clinical rotations in all hospital units and public health sectors.

There are two general recommendations concerning the minimum sample size necessary for conducting factorial analysis. The first recommendation emphasizes the significance of the absolute number of cases (N), while the second recommendation highlights the importance of the subject-to-variable ratio (p). Guilford suggested that N should be no less than 200 [ 21 ]. Additionally, MacCallum et al. recommended that the subject-to-variable ratio should be at least 5 [ 22 ]. A total of 390 nursing students voluntarily participated in the study.

Measurements

The CTDS, developed by Sosu, is an instrument used to measure the dispositional dimension of critical thinking [ 23 ]. Self-report questionnaires were given to the students. The demographic questionnaire collected included participants’ age, gender, education, and grade point average (GPA). The Critical Thinking Disposition Scale (CTDS) was used to measure the dispositional aspect of critical thinking. This scale comprises 11 items, employing a five-point Likert-type response format (1 = strongly disagree; 2 = disagree; 3 = neutral; 4 = agree; 5 = strongly agree). Total scores range from 11 to 55. The subscores include the first seven items reflecting a level of critical openness with a score ranging from 7 to 35 and the last four items indicating a level of reflective skepticism with a score ranging from 4 to 20. Higher CTDS scores indicate a greater degree of critical thinking [ 23 ].

Translation of the CTD scale

Following correspondence with the instroment developer, Dr. Sosu, and obtaining permission, the scale underwent translation using the standard Backward-Forward method. Initially, the scale was independently and simultaneously translated from English to Persian by two translators proficient in both Farsi and English. In the subsequent phase, these translations were juxtaposed and merged into a unified translation. This facilitated the comparison and identification of discrepancies, which were then rectified based on feedback from a panel of experts, including two psychometric experts and two nursing professors. In the third stage, the resulting Persian version was given to two translators fluent in Persian and English (distinct from those in the initial experts) to translate it back to English, thereby completing the translation of the scale from Persian to English. In the fourth stage, the two English translations were compared, and any disparities were resolved by the experts, culminating in a single translation. Subsequently, the prefinal version was evaluated for content and face validity.

Face validity

Face validity refers to the degree to which a measurement method seems to accurately measure the intended construct [ 24 ]. In a qualitative assessment of face validity, 10 nursing students were asked to evaluate factors such as the clarity of phrases and words, the coherence and relevance of items, the potential for ambiguity, and the necessity of removing or combining items. Additionally, two nursing professors and two psychometric specialists scrutinized the scale to determine whether it indeed appeared to measure its intended construct.

Content validity

Content validity examines the extent to which a collection of scale items aligns with the pertinent content domain of the construct it aims to assess [ 24 ]. The qualitative evaluation of the content validity involved a panel consisting of two nursing professors in the field of nursing education and two statisticians who were experts in psychometric topics. Their input on item placement, word selection, grammar adherence, and scoring accuracy of the scale and its instructions were solicited, with their feedback serving as the foundation for any required adjustments.

Construct validity

Exploratory factor analysis.

To explore the number of existing subscales and potential factors, Exploratory Factor Analysis (EFA) was performed using Principal Component Analysis (PCA) with varimax rotation. The scree plot and parallel analysis suggested the number of existing factors. The scree plot displays the eigenvalues of each factor extracted from the data in descending order. The number of factors was retained by examining the slope of the curve. A sharply decreasing slope indicates the optimal number of factors that capture the most variance in the data.

Confirmatory factor analysis

To confirm the findings of the EFA, Confirmatory Factor Analysis (CFA) was conducted using MPLUS to confirm the 2-factor structure identified with the items loaded on each factor in the EFA. Model fit indices, including Comparative Fit Index (CFI), the standardized residual root mean squared error (SRMR) of approximation and the root mean squared error of approximation (RMSEA), with cutoff points of > 0.95, < 0.08 and < 0.06, respectively, were used [ 25 ]. Factor loadings were reported using standardized beta coefficients to evaluate the strength of the relationships between items and factors, and a p value of 0.05 was considered a significant factor loading.

Convergent and discriminant validity

The mean scores for Critical Openness and Reflective Skepticism were computed. Convergent and Discriminant Validity was checked for correlations between students’ GPA as an indicator of academic achievement and the scores of the subscales of the CTDS.

  • Reliability

To assess reliability, internal consistency was tested using Cronbach’s alpha coefficient calculations for the total score and subscores. To assess the consistency of the test-retest approach over a two-week period, a group of 40 individuals from the target demographic underwent examination. Their scores from both sessions were analyzed to determine test-retest reliability, and the Intraclass Correlation Coefficient (ICC) was calculated.

Ethical considerations

The current study underwent assessment and received approval from the Research Ethics Committee at Shahid Beheshti University of Medical Sciences (Ethical code: IR.SBMU.RETECH.REC.1403.013). Permissions were duly acquired from the pertinent authorities at the research sites as well as the developer of the original scale. Nursing students were provided with comprehensive information regarding the study’s objectives, their right to withdraw from participation, and the confidentiality of their data. Informed consent was obtained from all participating students. All procedures adhered strictly to the appropriate guidelines and regulations.

Data analysis

Descriptive statistics, including the mean, standard deviation, median, range, and frequency, were used to describe the population and their critical thinking scores. Analysis of demographic characteristics, EFA, and reliability tests were performed using IBM SPSS Statistics (Version 27). CFA was performed using MPLUS (Program Copyright © 1998–2017 Muthén & Muthén Version 8) software.

Characteristics of the participants

A total of 390 participants (mean age = 21.74 (2.1) years; 64% female) completed the questionnaire.

Face validity and content validity

Face validity was established through the feedback of 10 nursing students, while content validity was assessed by four expert specialists. No alterations were made to the items in terms of their simplicity and clarity during the evaluation of both face and content validity.

The scree plot and parallel analysis suggested a two-factor solution (Fig.  1 ), which accounted for 55% of the total variance in the scores (Table  1 ). Factor loadings revealed a clear factor structure, with items loading on two factors. The rotated factor loadings are presented in Table  2 . The items clustered together on two distinct factors, with no cross-loadings observed.

figure 1

Scree Plot of the Persian version of the Critical Thinking Disposition Scale

The two factors were interpreted as follows: Factor 1, labeled “Critical Openness,” comprised items related to the level of critical openness; Factor 2, labeled “Reflective skepticism” included items reflecting the level of reflective skepticism. These factors align well with previously established dimensions.

CFA confirmed the model including the two factors with their respective indicators based on the EFA results. The CFA model demonstrated acceptable fit to the data: χ² (55) = 1500.38, p  < 0.001; RMSEA = 0.067; CFI = 0.95; TLI = 0.93, SRMR = 0.041. Although the chi-square test was significant, other fit indices indicated a reasonably good fit to the data.

The standardized factor loadings ranged from 0.61 to 0.77, Fig.  2 , all of which were statistically significant ( p  < 0.001) (Table  3 ). These loadings provided further support for the factor structure identified in the EFA.

figure 2

Factor structure of the Persian version of the Critical Thinking Disposition Scale

Convergent and discriminant validity were examined by determining factor scores based on item allocation. The analysis revealed a sample mean of 28.65 (SD = 2.7) for the Critical Openness factor and a mean of 16.8 (SD = 1.8) for Reflective Skepticism. A weak yet statistically significant correlation was observed between students’ GPA and their level of critical openness ( r  = 0.15, p  = 0.003), indicating a slight association between academic performance and this aspect of cognitive disposition.

The reliability of the CTDS-P was assessed through rigorous statistical analysis. Internal consistency was robust, as indicated by a Cronbach’s alpha coefficient of 0.87 for the overall scale, demonstrating the coherence of the items within the measure. Subscale analysis revealed strong reliability, with values of 0.83 for critical openness and 0.80 for reflective skepticism, indicating the consistency of responses across different dimensions of the construct. Additionally, the scale exhibited excellent test-retest reliability, as evidenced by an Intraclass Correlation Coefficient (ICC) of 0.96, with a 95% confidence interval ranging from 0.93 to 0.98, suggesting high stability and consistency of the scores over time.

The CTDS has undergone translation and cross-validation in different populations across the USA [ 26 ], Norway [ 27 ], Brazil [ 28 ], Spain [ 29 ], Turkey [ 30 ], and Vietnam [ 31 ]. The reliability and validity of this scale have been demonstrated in studies conducted among high school students [ 29 ] and university students [ 26 , 27 , 28 , 30 , 31 ]. The CTDS was first introduced as a tool to measure critical thinking disposition in undergraduate and postgraduate students [ 4 ].

This study used comprehensive reliability and validity tests to validate the CTDS in the Persian language and in Iranian nursing student participation. This study revealed the existence of two factors, critical openness and reflective skepticism. These factors align well with previously established studies [ 4 , 27 , 30 , 31 ]. Conversely, Spanish, Brazilian, and US versions demonstrated that the one-factor model fit better for their population [ 26 , 28 , 29 ].

In the face validity and content validity tests, neither the simplicity nor the clarity of the items were altered. The validity of the content was done qualitatively. Similar to previous quantitatively measured studies, our study has also confirmed the validity of the content [ 28 , 31 ].

The internal consistency of the CTDS demonstrated the coherence of the items within the measure. Several studies have reported similar internal consistency values for the CTDS, with Cronbach’s alpha measuring 0.88 according to Nguyen et al. 2023. Sosu et al. 2013, Akin et al. 2015, Yockey 2016, Bravo et al. 2020, and Gerdts-Andresen et al. 2022 also reported values of 0.79, 0.78, 0.79, 0.77, and 0.76, respectively. It is widely recognized that a Cronbach’s alpha coefficient of 0.70 or higher is acceptable [ 32 ]. Consequently, the CTDS has exhibited strong internal consistency across diverse linguistic and cultural contexts. In addition, the scale exhibited excellent test-retest reliability, thereby indicating a high level of stability and consistency in scores across time. Additionally, our study demonstrated an outstanding ICC [ 33 , 34 ].

There are several limitations to this study. Initially, the self-assessment survey may have been prone to social desirability bias, leading to potential overestimation of reported measures. To mitigate bias, this study utilized an anonymous survey. Moreover, the study used a cross-sectional design, which prevented the establishment of prospective predictive validity.

To conclude, our investigation establishes the CTDS as a reliable and valid tool for evaluating critical thinking disposition among Iranian nursing students. With its two-factor structure of “Critical Openness” and “Reflective Skepticism,” the scale offers valuable insights into cognitive disposition. Its robust psychometric properties underscore its potential for enhancing critical thinking skills in healthcare education and practice. Further research avenues may explore its nuanced applications in fostering analytical reasoning and problem-solving abilities.

Data availability

The datasets generated and/or analysed during the current study are not publicly available due to the necessity to ensure participant confdentiality policies and laws of the country but are available from the corresponding author on reasonable request.

Susiani TS, Salimi M, Hidayah R, editors. Research based learning (RBL): How to improve critical thinking skills? SHS Web of Conferences; 2018: EDP Sciences.

Nes AAG, Riegel F, Martini JG, Zlamal J, Bresolin P, Mohallem AGC, et al. Brazilian undergraduate nursing students’ critical thinking need to be increased: a cross-sectional study. Revista brasileira de enfermagem. 2022;76:e20220315.

Article   PubMed   PubMed Central   Google Scholar  

Wang Y, Nakamura T, Sanefuji W. The influence of parental rearing styles on university students’ critical thinking dispositions: the mediating role of self-esteem. Think Skills Creativity. 2020;37:100679.

Article   Google Scholar  

Sosu EM. The development and psychometric validation of a critical thinking Disposition Scale. Think Skills Creativity. 2013;9:107–19.

Hart C, Da Costa C, D’Souza D, Kimpton A, Ljbusic J. Exploring higher education students’ critical thinking skills through content analysis. Think Skills Creativity. 2021;41:100877.

Zhang C, Fan H, Xia J, Guo H, Jiang X, Yan Y. The effects of reflective training on the disposition of critical thinking for nursing students in China: a controlled trial. Asian Nurs Res. 2017;11(3):194–200.

Organization WH. Global standards for the initial education of professional nurses and midwives. World Health Organization; 2009.

Organization WH. State of the world’s nursing 2020: investing in education, jobs and leadership. 2020.

Dehghanzadeh S, Jafaraghaee F. Comparing the effects of traditional lecture and flipped classroom on nursing students’ critical thinking disposition: a quasi-experimental study. Nurse Educ Today. 2018;71:151–6.

Article   PubMed   Google Scholar  

Kermansaravi F, Navidian A, Kaykhaei A. Critical thinking dispositions among junior, senior and graduate nursing students in Iran. Procedia-Social Behav Sci. 2013;83:574–9.

Carvalho DPSRP, Vitor AF, Cogo ALP, Bittencourt GKGD, Santos VEP, Ferreira Júnior MA. Measurement of general critical thinking in undergraduate nursing students: experimental study. Texto Contexto-Enfermagem. 2019;29:e20180229.

Swing VK. Early identification of transformation in the proficiency level of critical thinking skills (CTS) for the first semester associate degree nursing (ADN) student. Capella University; 2014.

Dembitsky SL. Creating capable graduate nurses: a study of congruence of objectives and assessments of critical thinking in a southeastern state. Capella University; 2010.

López M, Jiménez JM, Martín-Gil B, Fernández-Castro M, Cao MJ, Frutos M, et al. The impact of an educational intervention on nursing students’ critical thinking skills: a quasi-experimental study. Nurse Educ Today. 2020;85:104305.

Wong SHV, Kowitlawakul Y. Exploring perceptions and barriers in developing critical thinking and clinical reasoning of nursing students: a qualitative study. Nurse Educ Today. 2020;95:104600.

Azizi-Fini I, Hajibagheri A, Adib-Hajbaghery M. Critical thinking skills in nursing students: a comparison between freshmen and senior students. Nurs Midwifery Stud. 2015;4(1).

Carter AG, Creedy DK, Sidebotham M. Measuring critical thinking in pre-registration midwifery students: a multi-method approach. Nurse Educ Today. 2018;61:169–74.

Scheffer BK, Rubenfeld MG. A consensus statement on critical thinking in nursing. SLACK Incorporated Thorofare, NJ; 2000. pp. 352–9.

Paul SA. Assessment of critical thinking: a Delphi study. Nurse Educ Today. 2014;34(11):1357–60.

Beaton DE, Bombardier C, Guillemin F, Ferraz MB. Guidelines for the process of cross-cultural adaptation of self-report measures. Spine. 2000;25(24):3186–91.

Article   CAS   PubMed   Google Scholar  

Guilford JP. Psychometric methods. 1954.

Zhang S, Hong S. Sample size in factor analysis. Psychol Methods. 1999;4(1):84–99.

Sosu E. The development and psychometric validation of a critical thinking disposition scale. Think Skills Creativity. 2013;9:107–19.

Wellington J, Szczerbinski M. Research methods for the social sciences. A&C Black; 2007.

Hancock G, Mueller R. Stractural equation modeling a second cource (second edi) information age publishing. 2020.

Yockey RD. Validation study of the critical thinking dispositions scale: a brief report. North Am J Psychol. 2016;18(1).

Gerdts-Andresen T, Hansen MT, Grøndahl VA. Educational Effectiveness: validation of an instrument to measure students’ critical thinking and Disposition. Int J Instruction. 2022;15(1):685–700.

Luiz FS, Leite ICG, Carvalho PHBd, Püschel VAA, Braga LM, Dutra HS, et al. Validity evidence of the critical thinking Disposition Scale, Brazilian version. Acta Paulista De Enfermagem. 2021;34:eAPE00413.

Bravo MJ, Galiana L, Rodrigo MF, Navarro-Perez JJ, Oliver A. An adaptation of the critical thinking disposition scale in Spanish youth. Think Skills Creativity. 2020;38:100748.

Akın A, Hamedoglu MA, Arslan S, Akın Ü, Çelik E, Kaya Ç, et al. The adaptation and validation of the Turkish version of the critical thinking Disposition Scale (CTDS). Int J Educational Researchers. 2015;6(1):31–5.

Google Scholar  

Nguyen TV, Kuo CL, Wang CY, Le NT, Nguyen MTT, Chuang YH. Assessment of the psychometric properties of the Vietnamese version of the critical thinking Disposition Scale. Nurse Educ Today. 2023;127:105848.

Ursachi G, Horodnic IA, Zait A. How reliable are measurement scales? External factors with indirect influence on reliability estimators. Procedia Econ Finance. 2015;20:679–86.

Weir JP. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J Strength Conditioning Res. 2005;19(1):231–40.

Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–63.

Download references

Acknowledgements

The authors would like to thank the nursing student who participated in this study.

This research did not receive any specific grant from funding agencies in the public, commercial, or not for profit sectors.

Author information

Authors and affiliations.

Student Research Committee, School of Nursing and Midwifery, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Hossein Bakhtiari-Dovvombaygi, Kosar Pourhasan, Akbar Zare-Kaseb & Amirreza Rashtbarzadeh

Medical Ethics and Law Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Hossein Bakhtiari-Dovvombaygi, Abbas Abbaszadeh & Fariba Borhani

Institute of Higher Education and Research in Healthcare, Department of Biology and Medicine, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland

Zahra Rahmaty

School of Nursing & Midwifery, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Fariba Borhani

You can also search for this author in PubMed   Google Scholar

Contributions

HBD, and FB, was involved in the conception and organization of the study. KP, AZK, and AR were involved in the execution and data collection of the study; ZR and AA participated in statistical analysis design and/or execution. KP, HBD and AZK, prepared the first draft of the manuscript. All authors contributed to the preparation, critical review and all of them approved the final manuscript.

Corresponding author

Correspondence to Fariba Borhani .

Ethics declarations

Ethics approval and consent to participate, consent for publication.

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Bakhtiari-Dovvombaygi, H., Pourhasan, K., Rahmaty, Z. et al. Evaluation of cross-cultural adaptation and validation of the Persian version of the critical thinking disposition scale: methodological study. BMC Nurs 23 , 463 (2024). https://doi.org/10.1186/s12912-024-02129-y

Download citation

Received : 14 May 2024

Accepted : 27 June 2024

Published : 08 July 2024

DOI : https://doi.org/10.1186/s12912-024-02129-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Critical thinking
  • Nursing student
  • Psychometric properties

BMC Nursing

ISSN: 1472-6955

research validity meaning

  • Open access
  • Published: 08 July 2024

Chinese version of the Tendency to Avoid Physical Activity and Sport (TAPAS) scale: testing unidimensionality, measurement invariance, concurrent validity, and known-group validity among Taiwanese youths

  • Yi-Ching Lin   ORCID: orcid.org/0000-0002-6835-7116 1 ,
  • Jung-Sheng Chen   ORCID: orcid.org/0000-0003-3187-9479 2 ,
  • Nadia Bevan   ORCID: orcid.org/0000-0002-5213-4057 3 ,
  • Kerry S. O’Brien   ORCID: orcid.org/0000-0003-3145-6038 3 ,
  • Carol Strong   ORCID: orcid.org/0000-0002-2934-5382 4 ,
  • Meng-Che Tsai   ORCID: orcid.org/0000-0003-1057-3315 5 , 6 ,
  • Xavier C. C. Fung   ORCID: orcid.org/0000-0002-0170-6203 7 ,
  • Ji-Kang Chen   ORCID: orcid.org/0000-0001-7762-3888 8 ,
  • I-Ching Lin   ORCID: orcid.org/0000-0002-1418-4977 9 , 10 , 11 ,
  • Janet D. Latner 12 &
  • Chung-Ying Lin   ORCID: orcid.org/0000-0002-2129-4242 13 , 14 , 15 , 16  

BMC Psychology volume  12 , Article number:  381 ( 2024 ) Cite this article

1 Altmetric

Metrics details

Background and objectives

Psychosocial factors affect individuals’ desire for physical activity. A newly developed instrument (Tendency to Avoid Physical Activity and Sport; TAPAS) has been designed to assess the avoidance of physical activity. Considering cultural differences could be decisive factors, the present study aimed to translate and validate the TAPAS into Chinese (Mandarin) for Taiwanese youths, and further cultural comparisons are expected.

Standard translation procedure (i.e., forward translation, back translation, and reconciliation) was used to translate the English TAPAS into the Chinese TAPAS. Following translation, 608 youths (mean [SD] age 29.10 [6.36] years; 333 [54.8%] women) participated in the study via a snowballing sampling method with an online survey. All participants completed the Chinese TAPAS and additional instruments assessing weight stigma and psychological distress. Confirmatory factor analysis (CFA) was used to examine the factor structure of the Chinese TAPAS and multigroup CFA to examine measurement invariance across gender (men vs. women) and weight status (overweight vs. non-overweight). Pearson correlations were used to examine the concurrent validity; independent t-tests between gender groups and weight status groups were used to examine the known-group validity.

Consistent with the English version, the Chinese TAPAS was found to have a one-factor structure evidenced by CFA results. The structure was invariant across gender and weight status groups evidenced by multigroup CFA results. Concurrent validity was supported by significant associations with the related constructs assessed ( r  = 0.326 to 0.676; p  < 0.001). Known-group validity was supported by the significant differences in TAPAS total scores between gender and weight status groups ( p  = 0.004 and < 0.001; Cohen’s d  = 0.24 and 0.48).

The Chinese version of the TAPAS is a valid and reliable instrument assessing individuals’ avoidance of physical activity and sports due to underlying psychosocial issues among Taiwanese youths. It is anticipated to be applied within a large Asian population, as well as cross-cultural comparisons, for further explorations in health, behavioral and epidemiological research and practice.

Peer Review reports

Introduction

The benefits of physical activity engagement have been well documented: physical activity is one of the most cost-effective methods to improve general health and prevent physical health conditions [ 1 , 2 ] as well as chronic diseases, such as overweight (or obesity) and cardiovascular disease [ 3 , 4 ]. Further, physical activity enhances cognitive and mental health and decreases the risk of mental conditions such as depression, anxiety, and panic disorders [ 5 , 6 ]. Therefore, healthcare providers and governments need to focus on improving the physical activity levels of the general population.

Although the benefits of physical activity are well acknowledged worldwide by healthcare providers and governments, barriers to physical activity engagement have been well documented [ 7 , 8 , 9 ]. For example, a lack of accessibility for physical activity may decrease individuals’ capability to engage in exercise, or a lack of time may prevent individuals from participating in sports or exercise [ 7 , 10 ]. Some barriers can be easily tackled by healthcare providers and related stakeholders (e.g., improving public facilities for physical activity engagement and helping people to improve time management for securing time for physical activity engagement). However, some psychosocial barriers may be harder to address, such as motivation, reluctance or avoiding physical activities.

Individuals’ reluctance and avoidance of physical activity are suggested as key factors to greater participation [ 11 ]. As such, a better understanding of the underlying cognitions for physical activity avoidance could help healthcare providers design more effective programs to enhance physical activity engagement [ 12 ] and address the actual factors causing the tendency to avoid physical activity and sports [ 11 , 13 ], for example, weight or body image issues. Weight stigma and concerns about physical appearance are important factors associated with individuals’ physical activity engagement [ 12 , 14 , 15 , 16 ]. Weight stigma is defined as weight-related discriminatory behaviors and ideologies towards people who have concerns about their weight and size, including weight-related self-stigma and perceived weight stigma [ 17 ]. Weight-stigmatized individuals frequently experience negative perceptions, stereotypes, and exclusion. An overweight body is often misjudged as reflecting personal health and physical activity inadequacies, potentially leading to decreased participation and motivation in physical activities [ 18 , 19 ]. Given these factors, the prevalence of weight stigma, notably impacting mental and physical health, emphasizes the need to investigate its occurrence, especially in sports and exercise environments known for weight discrimination [ 20 , 21 ]. In addition, body dissatisfaction includes negative cognitions and emotions toward individuals’ body images [ 22 ]. Body dissatisfaction has been positively associated with eating disorders, emotional disorders, suicidal attempts, impaired well-being, and physical activity avoidance [ 23 , 24 , 25 ]. The correlation between body dissatisfaction and adverse health outcomes may be attributed to feelings of shame and experiences of social isolation [ 26 ]. Previous studies suggest that negative self-perceptions, such as body dissatisfaction, commonly originate in environments that emphasize physical evaluation and physique-centric physical pursuits. There is a noted escalation of shame associated with body dissatisfaction, which has been linked with decreased engagement in physical activity and less-than-optimal athletic experiences [ 27 ]. However, there has been very little evidence on how weight stigma and body image issues may impact the tendency to avoid physical activity. To address the research gap, a new scale named the Tendency to Avoid Physical Activity and Sport (TAPAS) scale has been developed [ 11 , 12 ] to assess avoidance of physical activity due to psychosocial factors such as weight stigma and body image issues.

The Consensus-based Standards for selecting health status Measurement Instruments (COSMIN) guide [ 28 ] was applied to construct the TAPAS as a psychometric instrument. Weight stigma and body image experts followed the traditional scale development guidelines [ 29 , 30 , 31 ] to generate items assessing this concept. After a rigorous mixed-method approach, the initial pool of TAPAS included 21 items for psychometric testing, and the finalized TAPAS included 10 items with satisfactory item properties (e.g., high factor loadings). Moreover, the 10-item TAPAS was found to be a one-factor structure with satisfactory concurrent validity (with weight stigma; r  = 0.49) and known-group validity (women had significantly higher TAPAS score than men; Cohen’s d = 0.49) on an English-speaking (Australian) population [ 12 ]. However, the TAPAS has never been examined for its psychometric properties in Taiwanese ethnic populations. The barriers to engaging in physical activity and sports might vary considerably depending on cultural factors [ 32 , 33 ]. For example, people from the West engage in physical activities mainly for competition and skill improvement, while people from the East consider social affiliation and wellness as the main reasons for participating in sports. Therefore, it is essential to establish whether the TAPAS retains good psychometric properties in other ethnic populations, such as Asian people and languages (i.e., Taiwanese in the present study).

The purpose of the present study is to translate the English version of the TAPAS into a Traditional Chinese version and to examine the psychometric properties of the Chinese TAPAS among Taiwanese youths. The following psychometric properties were examined: (i) the factor structure of the TAPAS and its hypothesized one-factor structure; (ii) internal consistency/reliability of the TAPAS; (iii) measurement invariance of the TAPAS, with an assumption that the TAPAS is invariant across gender and weight status; (iv) concurrent validity of the TAPAS, with a hypothesized relationships between the TAPAS and associated constructs related to weight-related self-stigma, perceived weight stigma, and psychological distress; and (v) known-group validity of the TAPAS, and we hypothesized that the TAPAS could effectively distinguish the differences of avoidance of physical activity and sport between genders (i.e., men vs. women) and between weight status groups (i.e., overweight vs. non-overweight). Once a reliable and validated Chinese version of TAPAS is developed, it is anticipated to be applied within a large Asian population and cross-cultural comparisons for further understanding in health behavioral and epidemiological research and practice.

Participants and recruitment procedure

The National Cheng Kung University Human Research Ethics Committee (Approval No. NCKU HREC-E-110-486-2) and the National Cheng Kung University Hospital Institute of Review Board (IRB No. A-ER-111-445) approved the study protocol. Data was collected using snowballing sampling via the online survey platform of Survey Monkey , and at least 400 participants (200 overweight and 200 non-overweight) were required for the present study [ 34 ]. Firstly, the online survey link was distributed with the assistance of faculty members in several universities in Taiwan (including universities located in Northern [1 university], Central [1 university], and Southern Taiwan [2 universities]).

The participants who received the survey link were encouraged to share it with their friends or colleagues. The inclusion criteria of the target participants were (i) aged between 20 and 40 years; (ii) able to read traditional Chinese characters; (iii) having access to complete the online survey; and (iv) providing the e-form consent for participation. Data were collected between 31 August and 31 December 2022. Survey Monkey identified the participants’ IPs to avoid the same participant completing the online survey more than once. In addition, an experienced research assistant cleaned the data via the following methods: (i) if the same email addresses and phone numbers were used more than once; (ii) if there were unreasonable values (e.g., over 200 cm for height).

Translation of the Tendency to Avoid Physical Activity and Sport (TAPAS) scale

We used a standard translation procedure to translate the English TAPAS into the Chinese TAPAS with traditional Chinese characters [ 35 ]. In the first step, two bilingual translators who were native Chinese speakers with expertise in English independently conducted forward translation (i.e., from English to Chinese). In the second step, the two forward translators worked with the corresponding author to review the two forward-translated versions of TAPAS and generated a reconciled Chinese version of TAPAS. In the third step, a bilingual translator who was a native Chinese speaker with expertise in English and was unaware of the original TAPAS back-translated the reconciled Chinese version of TAPAS into English. In the fourth step, the two forward translators, the back translator, and the corresponding author had a panel meeting with other researchers with expertise in pediatrics, public health, psychometrics, physical activity, and weight to generate a final draft of the Chinese version of TAPAS. In the last step, three university students were invited to read the final draft (pre-survey) of the Chinese version of TAPAS for wording suggestions to finalize the Chinese version of TAPAS for formal psychometric evaluation. The students were given instructions on how to evaluate the wordings in Taiwanese culture. All three students agreed that the final draft of the Chinese version of TAPAS was readable without difficulties. Since the students suggested that no items required changes, no changes were made to the final draft. Moreover, the original developers (who are also the present study’s coauthors) assessed the Chinese version of TAPAS to ensure its equivalence to the original TAPAS in four levels: semantic, idiomatic, experiential, and conceptual.

Measures for concurrent validity (please see appendix for all the measures’ details)

Demographic measures.

The participants were asked for the following demographic information: age (reported in years), gender (reported in man, woman, or others), height (reported in cm), and weight (reported in kg). One item asks how many days a week the participants engaged in physical activity where breathing is somewhat harder than normal.

Tendency to Avoid Physical Activity and Sport (TAPAS) scale

The TAPAS is a 10-item instrument assessing the tendency levels of an individual avoiding participating in physical activity and sport due to weight or appearance concerns. Each TAPAS item was rated using a five-point Likert scale (1 = strongly disagree; 5 = strongly agree) and a total TAPAS score summed the 10 TAPAS item scores (total score between 10 and 50). A higher TAPAS total score indicates a higher level of tendency to avoid physical activity and sport. 11 The original TAPAS is an English version with good internal consistency (Cronbach’s α = 0.94) [ 12 ].

Weight Bias Internalization Scale (WBIS)

The WBIS is an 11-item instrument assessing the level of weight-related self-stigma. Each WBIS item was rated using a five-point Likert scale (1 = strongly disagree; 5 = strongly agree), and a total WBIS score summed the 11 WBIS item scores (total score between 11 and 55). A higher WBIS total score indicates a higher level of weight-related self-stigma [ 36 , 37 ]. The original WBIS is an English version with good internal consistency (Cronbach’s α = 0.90) [ 36 ]. The present study used the Chinese version of WBIS, which has good psychometric properties [ 38 , 39 ]. Moreover, the internal consistency of the WBIS used in the present sample was satisfactory: Cronbach’s α = 0.82; McDonald’s ω = 0.87.

Weight Self-Stigma Questionnaire (WSSQ)

The WSSQ is a 12-item instrument that also assesses the level of weight-related self-stigma. Each WSSQ item was rated using a five-point Likert scale (1 = strongly disagree; 5 = strongly agree), and a total WSSQ score summed the 12 WSSQ item scores (total score between 12 and 60). A higher WSSQ total score indicates a higher level of weight-related self-stigma [ 40 , 41 ]. The original WSSQ is an English version with good internal consistency (Cronbach’s α = 0.88) [ 40 ]. The present study used the Chinese version of WSSQ, which has good psychometric properties [ 39 , 42 ]. Moreover, the internal consistency of the WSSQ used in the present sample was satisfactory: Cronbach’s α = 0.92; McDonald’s ω = 0.92.

Perceived Weight Stigma Scale (PWSS)

The PWSS is a 10-item instrument assessing the level of perceived weight stigma. Each PWSS item was rated using a dichotomized scale (0 = no; 1 = yes), and a total PWSS score summed the 10 PWSS item scores (total score between 0 and 10). A higher PWSS total score indicates a higher level of perceived weight stigma [ 43 , 44 , 45 ]. The PWSS has been validated in different language versions with good internal consistency (e.g., Cronbach’s α = 0.85 in the Thai version; 0.90 in the Indonesian version) [ 17 , 46 , 47 , 48 ]. The present study used the Chinese version of PWSS, which has good psychometric properties [ 49 , 50 , 51 ]. Moreover, the internal consistency of the PWSS used in the present sample was satisfactory: Cronbach’s α = 0.86; McDonald’s ω = 0.87.

21-item Depression, Anxiety, Stress Scale (DASS-21)

The DASS-21 is a 21-item instrument assessing the level of psychological distress. Each DASS-21 item was rated using a four-point Likert scale (0 = did not apply to me at all; 3 = applied to me very much or most of the time) and a total DASS-21 score summed the 21 DASS-21 item scores (total score between 0 and 63). A higher DASS-21 total score indicates a higher level of psychological distress [ 52 ]. The original DASS-21 is an English version with good internal consistency (Cronbach’s α = 0.80 to 0.91) [ 53 ]. The present study used the Chinese version of DASS-21, which has good psychometric properties [ 54 ]. Moreover, the internal consistency of the DASS-21 used in the present sample was satisfactory: Cronbach’s α = 0.94; McDonald’s ω = 0.94.

Statistical analysis

For the participants’ weight status, their weight and height information was used to calculate body mass index (BMI; kg/m 2 ) and classify the participants into overweight (> 24 kg/m 2 ) or non-overweight according to Taiwanese norms [ 55 ]. For the psychometric testing, internal consistency, factor structure (i.e., if the TAPAS is a one-factor structure), measurement invariance, concurrent validity, and known-group validity of the TAPAS were examined. In addition to the psychometric testing, a mediation model was performed to examine the mediator role of TAPAS in the association between weight status and psychological distress. The internal consistency was conducted using the psych package in the R software, factor structure and measurement invariance using the lavaan package in the R software, concurrent validity, and known-group validity using the IBM SPSS version 20.0. The mediation model was performed using the Hayes’ Process Macro (Model 4) under the SPSS 20.0 version.

Internal consistency was assessed using Cronbach’s α and McDonald’s ω; both α and ω values larger than 0.7 indicate good internal consistency [ 56 ]. The one-factor structure of the TAPAS was examined using confirmatory factor analysis (CFA) with a diagonally weighted least squares (DWLS) estimator. The DWLS estimator was used because it is an estimator suitable for categorical scales, such as the Likert scale used in the TAPAS. We directly used CFA to test a unidimensional structure because the original TAPAS was found to have a one-factor structure from the results of an exploratory factor analysis (EFA) [ 12 ]. Although some may argue that a translated instrument needs to be tested using EFA, we considered that using CFA is better because the TAPAS also has a theoretical background and was verified by experts in this field to be a unidimensional structure. Considering the unidimensional evidence of the English TAPAS, we consider that directly testing CFA is appropriate. In the CFA, the following fit indices were used to examine if the present data fit the one-factor structure of the TAPAS: comparative fit index (CFI) and Tucker-Lewis index (TLI) > 0.9; root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR) < 0.08; and a non-significant χ 2 test [ 17 , 57 , 58 ].

Multigroup CFA was then used to examine if the one-factor structure of the TAPAS is invariant across gender (i.e., men vs. women) and across weight status (i.e., overweight vs. non-overweight). Three nested models were constructed for the invariance test [ 59 ]: Configural model (all the factor loadings and item intercepts were not constrained across groups), metric invariance model (only factor loadings but not item intercepts were constrained equal across groups), and scalar invariance model (both factor loadings and item intercepts were constrained equal across groups). Fit indices of ΔCFI, ΔTLI, ΔRMSEA, and ΔSRMR were then used to examine if the invariance is supported. When ΔCFI and ΔTLI > -0.01 together with ΔRMSEA and ΔSRMR < 0.01, the invariance is supported [ 47 , 60 , 61 ]. However, ΔRMSEA < 0.03 is proposed to be acceptable to claim metric invariance.

The present study aimed to recruit more than 200 participants who were overweight and more than 200 who were not overweight because the CFA required a sample size of over 200 to make an unbiased estimation [ 34 ]. Because the present study tested measurement invariance across weight status using the multigroup CFA, both subgroups (i.e., overweight subgroup and non-overweight subgroup) were needed to satisfy the minimal requirements for CFA.

Because the original TAPAS was found to be associated with weight stigma [ 12 ], we hypothesized that the Chinese version of TAPAS would be associated with weight stigma measures (including WBIS, WSSQ, and PWSS). In addition, prior evidence shows that weight stigma is associated with psychological distress [ 62 ]. Therefore, we further hypothesized that the Chinese version of TAPAS would be associated with psychological distress measures (i.e., DASS-21).

For known-group validity, independent t-tests and Cohen’s d were used to examine if the TAPAS total score can distinguish between genders (i.e., men and women) and weight status groups (i.e., overweight and non-overweight). We hypothesized that women would have a higher TAPAS score than men based on the findings from the original TAPAS [ 12 ], and the overweight group would have a higher TAPAS score than the non-overweight group because the findings from the original TAPAS showed that BMI is associated with TAPAS [ 12 ].

Evidence suggests that weight status is associated with the TAPAS score, and the TAPAS score could be associated with psychological distress; therefore, a mediation model was tested to examine if the TAPAS score mediates the association between weight status and psychological distress. In the mediation model, Model 4 of Hayes’ Process Macro was used with the following set-ups: weight status as the independent variable, TAPAS score as the mediator, and DASS-21 as the dependent variable. 5000 bootstrapping resamples were used to examine the mediated effect of TAPAS: when the 95% bootstrapping confidence interval (CI) does not include 0, the mediated effect is supported.

Participants’ demographics

Table  1 presents the participants’ characteristics. Among the 608 participants (mean [SD] age 29.10 [6.36] years), over half were women ( n  = 333; 54.8%), and slightly less than half were men ( n  = 273; 44.9%). Two participants self-reported their gender as other (0.3%). On average, the present sample had a height of 166.43 (SD = 8.50) cm, a weight of 66.51 (SD = 14.96) kg, a BMI of 23.89 (SD = 4.42) kg/m 2 , and 2.08 (SD = 1.95) days engaging in physical activity where breathing is somewhat harder than normal. Accordingly, nearly half of the present sample was classified as overweight ( n  = 268; 44.1%). Table  1 additionally reported the mean total scores of TAPAS, WBIS, WSSQ, PWSS, and DASS-21 for the present sample.

Score distribution and internal consistency results

Table  2 presents the score distributions for the TAPAS items: the distributions could be considered normally distributed: skewness ranged between − 0.548 and 0.609; kurtosis ranged between − 1.126 and − 0.487. Moreover, the mean scores of the 10 TAPAS items ranged between 2.29 and 3.45. Cronbach’s α (0.95) and McDonald’s ω (0.95) of the TAPAS indicated excellent internal consistency, and the corrected item-to-total correlations of the TAPAS items were relatively strong (0.511 to 0.881).

Confirmatory factor analysis with measurement invariance results

Except for the significant χ 2 (χ 2  = 57.362; df = 35; p  = 0.01), the CFA results further supported the one-factor structure for the TAPAS: CFI = 0.998; TLI = 0.997; RMSEA = 0.032 (90% CI = 0.016, 0.047); SRMR = 0.042. Additionally, factor loadings of the TAPAS items were relatively strong (0.519 to 0.912) (Table  2 ). The one-factor structure of the TAPAS was further found to be invariant across two gender groups (i.e., men and women; those genders reported as other were not used for the invariance testing because of a too-small sample size) and across two weight status groups (i.e., overweight and non-overweight). All nested models showed satisfactory fit indices (CFI = 0.999 to 1.000; TLI = 1.000 to 1.001; RMSEA = 0.000 to 0.031; SRMR = 0.044 to 0.055).

Moreover, ΔCFI, ΔTLI, and ΔRMSEA were all 0.000 when comparing the metric invariance model with the configurable model and the scalar invariance model with the metric invariance model across gender groups. ΔSRMR was 0.002 when comparing the metric invariance model with the configurable model and was − 0.001 when comparing the scalar invariance model with the metric invariance model across gender groups. ΔCFI and ΔTLI were both − 0.001 when comparing the metric invariance model with the configurable model and the scalar invariance model with the metric invariance model across weight status groups. ΔRMSEA was 0.023 when comparing metric invariance model with configural model and was 0.008 when comparing scalar invariance model with metric invariance model across gender groups. ΔSRMR was 0.008 when comparing metric invariance model with configural model and was 0.001 when comparing scalar invariance model with metric invariance model across gender groups (Table  3 ).

Concurrent validity, known-group, and mediation results

Concurrent validity of the TAPAS was evidenced by significant correlations with other external measures: r  = 0.68 ( p  < 0.001) with WBIS; 0.66 ( p  < 0.001) with WSSQ; 0.33 ( p  < 0.001) with PWSS; and 0.37 ( p  < 0.001) with DASS-21 (Table  4 ).

Moreover, the known-group validity of the TAPAS was found in gender (men vs. women) and weight status (overweight vs. non-overweight) groups. Specifically, women reported significantly higher scores than men (Cohen’s d  = 0.24; p  = 0.004). People with overweight also reported significantly higher scores than those without overweight (Cohen’s d  = 0.48; p  < 0.001) (Table  5 ).

Lastly, the mediation model showed that TAPAS was a significant mediator in the association between weight status and psychological distress (unstandardized coefficient = 2.02; bootstrapping standard error = 0.40; 95% bootstrapping CI = 1.27, 2.83).

The present paper sought to develop a Traditional Chinese version of the TAPAS measure and establish its psychometric properties in an Asian population. All hypotheses were supported. The one-factor structure (Hypothesis 1) and internal consistency (Hypothesis 2) of the Chinese version of TAPAS were supported. The measurement invariance of the Chinese version of TAPAS was verified (Hypothesis 3) across genders (women vs. men) and weight status groups (overweight vs. non-overweight). The concurrent validity of TAPAS (Hypothesis 4) was supported via significant correlations with related construct criterion instruments. Moreover, the known-group validity of TAPAS (Hypothesis 5) was supported by significant TAPAS total score differences between men and women and between individuals who were overweight and those not overweight.

Similar to the findings from Bevan et al. [ 12 ], the present study found that the 10-item TAPAS in its Chinese version has a one-factor structure with good internal consistency. Moreover, the factor structure finding in the present study extends the exploratory factor analysis finding from Bevan et al. [ 12 ] to CFA. The one-factor structure also agrees with prior evidence using Rasch analysis on the TAPAS [ 63 , 64 ]. The one-factor structure was further evidenced to be invariant across gender subgroups and weight status subgroups in the present study. Bevan et al. [ 12 ] found that the TAPAS was associated with weight stigma, and this association was replicated in the present findings ( r  = 0.68 with WBIS, 0.66 with WSSQ, and 0.33 with PWSS). We extended the concurrent validity from weight stigma measures to psychological distress measures (i.e., r  = 0.37 with DASS-21 in the present findings). Consistent with the known-group validity findings from Bevan et al. [ 12 ], the present study found that women had significantly higher scores of TAPAS than men (Cohen’s d = 0.24). The present study further extended the known-group findings that overweight individuals had significantly higher TAPAS scores than those without (Cohen’s d = 0.48).

Supporting previous literature stating that weight status is associated with TAPAS score and TAPAS score could be associated with psychological distress; a mediation model was tested to examine if the TAPAS score mediated the association between weight status and psychological distress. Applying the Hayes’ Process Macro (Model 4), the mediator role of TAPAS in the association between weight status and psychological distress was shown in the present study. Hopefully, the addition of the mediation analysis solidifies the study and develops a strong base for further exploration. Because the TAPAS was found to be valid in the present Taiwanese sample, we compared the TAPAS scores of the present sample with Bevan et al.’s findings [ 12 ]. Bevan et al. [ 12 ] found that their Australian sample had a mean score of 19.63 (SD = 10.18), which is much lower than the present sample (mean [SD] score = 25.36 [9.54]). Moreover, a recent paper testing TAPAS psychometric properties on a mainland Chinese sample showed a mean TAPAS score of 23.44 (8.70) [ 65 ], indicating similar TAPAS scores between the two Chinese populations (i.e., Taiwanese and mainland Chinese). In other words, Western people (e.g., the Australian sample from Bevan et al.’s study [ 12 ]) and Eastern people (e.g., the present Taiwanese sample and the mainland Chinese sample from Saffari et al.’s study [ 61 ]) may have different levels of physical activity avoidance. However, more evidence is needed to understand better how culture may impact physical activity avoidance.

The TAPAS provides healthcare professionals and government policy-makers with a new tool for understanding the reasons for non-participation in physical activity and sports across different population groups. It also offers insights into the psychosocial factors that might be addressed to promote physical activity. Previous studies have mostly focused on structural and environmental factors associated with physical activity engagement [ 7 , 8 , 9 ]. However, some individuals, especially those who are overweight, might have psychosocial reasons for avoiding physical activity and sports [ 11 , 12 ]. Therefore, the TAPAS can provide psychosocial factors to evaluate an individual’s tendency to avoid physical activity and sports. Accordingly, healthcare providers or government authorities can design appropriate programs to reduce psychosocial barriers (e.g., weight stigma or appearance concerns) for individuals to increase physical activity and use the TAPAS to evaluate the program’s effectiveness.

The present study has some limitations. First, the present study recruited Taiwanese youths aged between 20 and 40. Therefore, the generalizability of the present study’s findings might not be applied to other age groups (e.g., older people and adolescents). Second, the present study adopted a snowballing sampling method, which decreased the representativeness of the present sample. Indeed, the demographic information was not comparable to that of university students in Taiwan. Third, we did not examine the test-retest reliability of the Chinese version of the TAPAS. Therefore, whether the Chinese version of TAPAS possesses good reproducibility is unclear. Fourth, the responsiveness of the Chinese version of TAPAS was not examined. Therefore, future studies are needed to investigate whether the Chinese version of TAPAS can detect meaningful changes in the tendency of physical activity and sports avoidance. Fifth, given that this study applied a cross-sectional design, an important psychometric property of longitudinal invariance could not be examined. Future studies are needed to examine if the TAPAS could be interpreted invariantly across time. Sixth, the present study did not test content validity after the TAPAS was translated into Chinese. Seventh, the measures assessed in the present study were self-reported. Therefore, the present study’s findings, especially the concurrent validity findings, are subject to the common method variance bias. Finally, the present study did not have information regarding the participants’ daily exercise and sports engagement. Although we found that people who were overweight had higher TAPAS scores, this it could not be interpreted that they would have completely avoided physical activity and sports in their daily lives. Therefore, future studies are needed to identify the association between the tendency to avoid physical activity and sport with actual participation in physical activity and sport.

The Chinese version of the TAPAS was found to be a valid and reliable instrument assessing individuals’ avoidance of physical activity and sport, paving the way for future cross-cultural comparisons. With the utilization of TAPAS, healthcare providers and government authorities in Taiwan can design programs to help Taiwanese youths reduce the impact of weight stigma and body image concerns on physical activity engagement. Specifically, the TAPAS can be used as an outcome evaluation to examine if intervention programs are effective. Apart from evaluating program effectiveness, the TAPAS can be used to assess how psychosocial factors may play a role in avoiding physical activity and sport.

Data availability

The datasets generated and/or analyzed during the current study are not publicly available due to ethical restrictions but are available from the corresponding author, C-Y.L, upon reasonable request.

Janssen I, Leblanc AG. Systematic review of the health benefits of physical activity and fitness in school-aged children and youth. Int J Behav Nutr Phys Act. 2010;7:40. https://doi.org/10.1186/1479-5868-7-40 .

Article   PubMed   PubMed Central   Google Scholar  

Warburton DE, Nicol CW, Bredin SS. Health benefits of physical activity: the evidence. CMAJ. 2006;174(6):801–9. https://doi.org/10.1503/cmaj.051351 .

Ford ES, Caspersen CJ. Sedentary behaviour and cardiovascular disease: a review of prospective studies. Int J Epidemiol. 2012;41(5):1338–53. https://doi.org/10.1093/ije/dys078 .

Article   PubMed   Google Scholar  

Katzmarzyk PT, Powell KE, Jakicic JM, Troiano RP, Piercy K, Tennant B. 2018 PHYSICAL ACTIVITY GUIDELINES ADVISORY COMMITTEE*. Sedentary Behavior and Health: Update from the 2018 Physical Activity Guidelines Advisory Committee. Med Sci Sports Exerc. 2019;51(6):1227–1241. https://doi.org/10.1249/MSS.0000000000001935 .

Paluska SA, Schwenk TL. Physical activity and mental health: current concepts. Sports Med. 2000;29(3):167–80. https://doi.org/10.2165/00007256-200029030-00003 .

Rodriguez-Ayllon M, Cadenas-Sánchez C, Estévez-López F, Muñoz NE, Mora-Gonzalez J, Migueles JH, et al. Role of physical activity and sedentary behavior in the Mental Health of Preschoolers, children and adolescents: a systematic review and Meta-analysis. Sports Med. 2019;49(9):1383–410. https://doi.org/10.1007/s40279-019-01099-5 .

Humpel N, Owen N, Leslie E. Environmental factors associated with adults’ participation in physical activity: a review. Am J Prev Med. 2002;22(3):188–99. https://doi.org/10.1016/s0749-3797(01)00426-3 .

Lin YC, Lin CY, Fan CW, Liu CH, Ahorsu DK, Chen DR, et al. Changes of Health outcomes, Healthy Behaviors, Generalized Trust, and accessibility to Health Promotion Resources in Taiwan before and during COVID-19 pandemic: comparing 2011 and 2021 Taiwan Social Change Survey (TSCS) cohorts. Psychol Res Behav Manag. 2022;15:3379–89. https://doi.org/10.2147/PRBM.S386967 .

Salmon J, Owen N, Crawford D, Bauman A, Sallis JF. Physical activity and sedentary behavior: a population-based study of barriers, enjoyment, and preference. Health Psychol. 2003;22(2):178–88. https://doi.org/10.1037//0278-6133.22.2.178 .

Arzu D, Tuzun EH, Eker L. Perceived barriers to physical activity in university students. J Sports Sci Med. 2006;5(4):615–20.

PubMed   PubMed Central   Google Scholar  

Bevan N, O’Brien KS, Lin CY, Latner JD, Vandenberg B, Jeanes R, et al. The relationship between Weight Stigma, physical appearance concerns, and enjoyment and tendency to avoid physical activity and Sport. Int J Environ Res Public Health. 2021;18(19):9957. https://doi.org/10.3390/ijerph18199957 .

Bevan N, O’Brien KS, Latner JD, Lin CY, Vandenberg B, Jeanes R, et al. Weight Stigma and Avoidance of Physical Activity and Sport: development of a scale and establishment of correlates. Int J Environ Res Public Health. 2022;19(23):16370. https://doi.org/10.3390/ijerph192316370 .

Cheng OY, Yam CLY, Cheung NS, Lee PLP, Ngai MC, Lin CY. Extended theory of Planned Behavior on Eating and Physical Activity. Am J Health Behav. 2019;43(3):569–81. https://doi.org/10.5993/AJHB.43.3.11 .

Fung XCC, Pakpour AH, Wu YK, Fan CW, Lin CY, Tsang HWH. Psychosocial variables related to weight-related self-stigma in physical activity among young adults across Weight Status. Int J Environ Res Public Health. 2019;17(1):64. https://doi.org/10.3390/ijerph17010064 .

Liu W, Chen JS, Gan WY, Poon WC, Tung SEH, Lee LJ, et al. Associations of Problematic Internet Use, Weight-related Self-Stigma, and Nomophobia with physical activity: findings from Mainland China, Taiwan, and Malaysia. Int J Environ Res Public Health. 2022;19(19):12135. https://doi.org/10.3390/ijerph191912135 .

Saffari M, Chen JS, Wu HC, Fung XCC, Chang CC, Chang YL, et al. Effects of Weight-Related Self-Stigma and Smartphone Addiction on Female University Students’ physical activity levels. Int J Environ Res Public Health. 2022;19(5):2631. https://doi.org/10.3390/ijerph19052631 .

Nadhiroh SR, Nurmala I, Pramukti I, Tivany ST, Tyas LW, Zari AP, et al. Weight stigma in Indonesian young adults: validating the Indonesian versions of the weight self-stigma questionnaire and perceived weight stigma scale. Asian J Soc Health Behav. 2022;5(4):169. https://doi.org/10.4103/shb.shb_189_22 .

Article   Google Scholar  

Jackson SE, Steptoe A. Association between perceived weight discrimination and physical activity: a population-based study among English middle-aged and older adults. BMJ Open. 2017;7(3):e014592. https://doi.org/10.1136/bmjopen-2016-014592 .

Vartanian LR, Novak SA. Internalized societal attitudes moderate the impact of weight stigma on avoidance of exercise. Obes (Silver Spring). 2011;19(4):757–62. https://doi.org/10.1038/oby.2010.234 .

Pickett AC, Cunningham GB. Body weight stigma in physical activity settings. Am J Health Stud. 2018;33(1):21–9. https://doi.org/10.47779/ajhs.2018.53 .

Wu YK, Berry DC. Impact of weight stigma on physiological and psychological health outcomes for overweight and obese adults: a systematic review. J Adv Nurs. 2018;74(5):1030–42. https://doi.org/10.1111/jan.13511 .

Grogan S. Body image: understanding body dissatisfaction in men, women and children. Routledge; 2021.

Ren Y, Barnhart WR, Cui T, Song J, Tang C, Cui S, et al. Exploring the longitudinal association between body dissatisfaction and body appreciation in Chinese adolescents: a four-wave, random intercept cross-lagged panel model. Body Image. 2023;46:32–40. https://doi.org/10.1016/j.bodyim.2023.04.011 .

Hartmann T, Zahner L, Pühse U, Schneider S, Puder JJ, Kriemler S. Physical activity, bodyweight, health and fear of negative evaluation in primary school children. Scand J Med Sci Sports. 2010;20(1):e27–34. https://doi.org/10.1111/j.1600-0838.2009.00888.x .

More KR, Phillips LA, Eisenberg Colman MH. Evaluating the potential roles of body dissatisfaction in exercise avoidance. Body Image. 2019;28:110–4. https://doi.org/10.1016/j.bodyim.2019.01.003 .

Tran BX, Dang KA, Le HT, Ha GH, Nguyen LH, Nguyen TH, Tran TH, Latkin CA, Ho CSH, Ho RCM. Global Evolution of Obesity Research in children and youths: setting priorities for interventions and policies. Obes Facts. 2019;12(2):137–49. https://doi.org/10.1159/000497121 .

Sabiston CM, Pila E, Vani M, Thogersen-Ntoumani C. Body image, physical activity, and sport: a scoping review. Psychol Sport Exerc. 2019;42:48–57. https://doi.org/10.1016/j.psychsport.2018.12.010 .

Clark LA, Watson D. Constructing validity: basic issues in objective scale development. Psychol Assess. 1995;7(3):309–19. https://doi.org/10.1037/1040-3590.7.3.309 .

Loewenthal KM, Lewis CA. An introduction to psychological tests and scales. 2nd ed. Psychology; 2018.

Smith GT, McCarthy DM. Methodological considerations in the refinement of clinical assessment instruments. Psychol Assess. 1995;7:300–8. https://doi.org/10.1037/1040-3590.7.3.300 .

Samara A, Nistrup A, Al-Rammah TY, Aro AR. Lack of facilities rather than sociocultural factors as the primary barrier to physical activity among female Saudi university students. Int J Women’s Health. 2015;7:279–86. https://doi.org/10.2147/IJWH.S80680 .

Kljajević V, Stanković M, Đorđević D, Trkulja-Petković D, Jovanović R, Plazibat K, et al. Physical activity and physical fitness among University Students-A Systematic Review. Int J Environ Res Public Health. 2021;19(1):158. https://doi.org/10.3390/ijerph19010158 .

Kline RB. Principles and practice of structural equation modeling. Guilford Press; 1998.

Maneesriwongul W, Dixon JK. Instrument translation process: a methods review. J Adv Nurs. 2004;48(2):175 – 86. https://doi.org/10.1111/j.1365-2648.2004.03185.x . PMID: 15369498.

Durso LE, Latner JD. Understanding self-directed stigma: development of the weight bias internalization scale. Obes (Silver Spring). 2008;16(Suppl 2):S80–6. https://doi.org/10.1038/oby.2008.448 .

Kamolthip R, Fung XCC, Lin CY, Latner JD, O’Brien KS. Relationships among physical activity, Health-Related Quality of Life, and Weight Stigma in Children in Hong Kong. Am J Health Behav. 2021;45(5):828–42. https://doi.org/10.5993/AJHB.45.5.3 .

Chen H, Ye YD. Validation of the Weight Bias internalization scale for Mainland Chinese children and adolescents. Front Psychol. 2021;11:594949. https://doi.org/10.3389/fpsyg.2020.594949 .

Pakpour AH, Tsai MC, Lin YC, Strong C, Latner JD, Fung XCC, et al. Psychometric properties and measurement invariance of the Weight Self-Stigma Questionnaire and Weight Bias internalization scale in children and adolescents. Int J Clin Health Psychol. 2019;19(2):150–9. https://doi.org/10.1016/j.ijchp.2019.03.001 .

Lillis J, Luoma JB, Levin ME, Hayes SC. Measuring weight self-stigma: the weight self-stigma questionnaire. Obes (Silver Spring). 2010;18(5):971–6. https://doi.org/10.1038/oby.2009.353 .

Lin CY, Imani V, Cheung P, Pakpour AH. Psychometric testing on two weight stigma instruments in Iran: Weight Self-Stigma Questionnaire and Weight Bias internalized Scale. Eat Weight Disord. 2020;25(4):889–901. https://doi.org/10.1007/s40519-019-00699-4 .

Fan CW, Liu CH, Huang HH, Lin CY, Pakpour AH. Weight Stigma Model on Quality of Life among children in Hong Kong: a cross-sectional modeling study. Front Psychol. 2021;12:629786. https://doi.org/10.3389/fpsyg.2021.629786 .

Cheng MY, Wang SM, Lam YY, Luk HT, Man YC, Lin CY. The relationships between Weight Bias, Perceived Weight Stigma, eating Behavior, and psychological distress among undergraduate students in Hong Kong. J Nerv Ment Dis. 2018;206(9):705–10. https://doi.org/10.1097/NMD.0000000000000869 .

Gan WY, Tung SEH, Ruckwongpatr K, Ghavifekr S, Paratthakonkun C, Nurmala I, et al. Evaluation of two weight stigma scales in Malaysian university students: weight self-stigma questionnaire and perceived weight stigma scale. Eat Weight Disord. 2022;27(7):2595–604. https://doi.org/10.1007/s40519-022-01398-3 .

Huang PC, Lee CH, Griffiths MD, O’Brien KS, Lin YC, Gan WY, et al. Sequentially mediated effects of weight-related self-stigma and psychological distress in the association between perceived weight stigma and food addiction among Taiwanese university students: a cross-sectional study. J Eat Disord. 2022;10(1):177. https://doi.org/10.1186/s40337-022-00701-y .

Chen CY, Chen IH, O’Brien KS, Latner JD, Lin CY. Psychological distress and internet-related behaviors between schoolchildren with and without overweight during the COVID-19 outbreak. Int J Obes. 2021;45(3):677–86. https://doi.org/10.1038/s41366-021-00741-5 .

Chirawat P, Kamolthip R, Rattaprach R, Nadhiroh SR, Tung SEH, Gan WY, et al. Weight stigma among young adults in Thailand: reliability, validation, and Measurement Invariance of the thai-translated Weight Self Stigma Questionnaire and Perceived Weight Stigma Scale. Int J Environ Res Public Health. 2022;19(23):15868. https://doi.org/10.3390/ijerph192315868 .

Fung XCC, Siu AMH, Potenza MN, O’Brien KS, Latner JD, Chen CY, et al. Problematic use of internet-related activities and Perceived Weight Stigma in Schoolchildren: a longitudinal study across different epidemic periods of COVID-19 in China. Front Psychiatry. 2021;12:675839. https://doi.org/10.3389/fpsyt.2021.675839 .

Ruckwongpatr K, Saffari M, Fung XCC, O’Brien KS, Chang YL, Lin YC, et al. The mediation effect of perceived weight stigma in association between weight status and eating disturbances among university students: is there any gender difference? J Eat Disord. 2022;10(1):28. https://doi.org/10.1186/s40337-022-00552-7 .

Lin CY, Strong C, Latner JD, Lin YC, Tsai MC, Cheung P. Mediated effects of eating disturbances in the association of perceived weight stigma and emotional distress. Eat Weight Disord. 2020;25(2):509–18. https://doi.org/10.1007/s40519-019-00641-8 .

Xu P, Chen JS, Chang YL, Wang X, Jiang X, Griffiths MD, et al. Gender differences in the associations between Physical Activity, Smartphone Use, and Weight Stigma. Front Public Health. 2022;10:862829. https://doi.org/10.3389/fpubh.2022.862829 .

Lovibond SH, Lovibond PF. Manual for the depression anxiety stress scales. 2nd ed. Sydney: Psychology Foundation of Australia; 1995.

Google Scholar  

Sinclair SJ, Siefert CJ, Slavin-Mulford JM, Stein MB, Renna M, Blais MA. Psychometric evaluation and normative data for the depression, anxiety, and stress scales-21 (DASS-21) in a nonclinical sample of U.S. adults. Eval Health Prof. 2012;35(3):259–79. https://doi.org/10.1177/0163278711424282 .

Wang K, Shi HS, Geng FL, Zou LQ, Tan SP, Wang Y, et al. Cross-cultural validation of the Depression anxiety stress Scale-21 in China. Psychol Assess. 2016;28(5):e88–100. https://doi.org/10.1037/pas0000207 .

Hsieh TH, Lee JJ, Yu EW, Hu HY, Lin SY, Ho CY. Association between obesity and education level among the elderly in Taipei, Taiwan between 2013 and 2015: a cross-sectional study. Sci Rep. 2020;10(1):20285. https://doi.org/10.1038/s41598-020-77306-5 .

Nunnally JC. Psychometric theory. 2nd ed. New York: McGraw-Hill; 1978.

Lin CY, Broström A, Griffiths MD, Pakpour AH. Psychometric evaluation of the persian eHealth literacy scale (eHEALS) among elder iranians with Heart failure. Eval Health Prof. 2020;43(4):222–9. https://doi.org/10.1177/0163278719827997 .

Nejati B, Fan CW, Boone WJ, Griffiths MD, Lin CY, Pakpour AH. Validating the persian intuitive eating Scale-2 among breast Cancer survivors who are Overweight/Obese. Eval Health Prof. 2021;44(4):385–94. https://doi.org/10.1177/0163278720965688 .

Cheung GW, Rensvold RB. Evaluating goodness-of-fit indexes for testing measurement invariance. Struct Equ Model. 2002;9(2):233–55. https://doi.org/10.1207/S15328007SEM0902_5 .

Chen IH, Huang PC, Lin YC, Gan WY, Fan CW, Yang WC, et al. The Yale Food Addiction Scale 2.0 and the modified Yale Food Addiction Scale 2.0 in Taiwan: factor structure and concurrent validity. Front Psychiatry. 2022;13:1014447. https://doi.org/10.3389/fpsyt.2022.1014447 .

Rutkowski L, Svetina D. Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educ Psychol Meas. 2014;74(1):31–57. https://doi.org/10.1177/0013164413498257 .

Alimoradi Z, Golboni F, Griffiths MD, Broström A, Lin CY, Pakpour AH. Weight-related stigma and psychological distress: a systematic review and meta-analysis. Clin Nutr. 2020;39(7):2001–13. https://doi.org/10.1016/j.clnu.2019.10.016 .

Fan CW, Chang YL, Huang PC, Fung XCC, Chen JK, Bevan N, et al. The tendency to avoid physical activity and sport scale (TAPAS): Rasch analysis with differential item functioning testing among a Chinese sample. BMC Psychol. 2023;11(1):369. https://doi.org/10.1186/s40359-023-01377-y .

Fan CW, Huang PC, Chen IH, Huang YT, Chen JS, Fung XCC, et al. Differential item functioning for the tendency of avoiding physical activity and Sport Scale across two subculture samples: Taiwanese and mainland Chinese university students. Heliyon. 2023;9(12):e22583. https://doi.org/10.1016/j.heliyon.2023.e22583 .

Saffari M, Chen IH, Huang PC, O’Brien KS, Hsieh YP, Chen JK, et al. Measurement invariance and psychometric evaluation of the tendency to avoid physical activity and Sport Scale (TAPAS) among Mainland Chinese University students. Psychol Res Behav Manag. 2023;16:3821–36. https://doi.org/10.2147/PRBM.S425804 .

Mokkink LB, Terwee CB, Knol DL, Stratford PW, Alonso J, Patrick DL, et al. The COSMIN checklist for evaluating the methodological quality of studies on measurement properties: a clarification of its content. BMC Med Res Methodol. 2010;10:22. https://doi.org/10.1186/1471-2288-10-22 .

Download references

Acknowledgements

This research was funded by Asia University Hospital, grant number AUH-11351003, the Ministry of Science and Technology, Taiwan (MOST 110-2410-H-006-115; MOST 111-2410-H-006-100), the National Science and Technology Council, Taiwan (NSTC 112-2410-H-006-089-SS2), and the Higher Education Sprout Project, Ministry of Education to the Headquarters of University Advancement at National Cheng Kung University (NCKU).

Author information

Authors and affiliations.

Department of Early Childhood and Family Education, National Taipei University of Education, Taipei, Taiwan

Yi-Ching Lin

Department of Medical Research, E-Da Hospital, Kaohsiung, Taiwan

Jung-Sheng Chen

School of Social Sciences, Monash University, Melbourne, 3800, Australia

Nadia Bevan & Kerry S. O’Brien

Department of Public Health, College of Medicine, National Cheng Kung University, Tainan, Taiwan

Carol Strong

Department of Pediatrics, College of Medicine, National Cheng Kung University Hospital, National Cheng Kung University, Tainan, Taiwan

Meng-Che Tsai

Department of Medical Humanities and Social Medicine, College of Medicine, National Cheng Kung University, Tainan, Taiwan

Department of Rehabilitation Sciences, Faculty of Health and Social Sciences, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China

Xavier C. C. Fung

Department of Social Work, Chinese University of Hong Kong, New Territories, Hong Kong, China

Ji-Kang Chen

Department of Healthcare Administration, Department of Family Medicine, Asia University, Taichung, 41354, Taiwan

I-Ching Lin

Department of Family Medicine, Asia University Hospital, Taichung, 41354, Taiwan

Department of Kinesiology, Health, and Leisure, Chienkuo Technology University, Changhua, 50094, Taiwan

Department of Psychology, University of Hawaii, Honolulu, HI, 96822, USA

Janet D. Latner

Institute of Allied Health Sciences, Departments of Occupational Therapy and Public Health, and Biostatistics Consulting Center, College of Medicine, National Cheng Kung University Hospital, National Cheng Kung University, 1 University Rd, Tainan, 701401, Taiwan

Chung-Ying Lin

University of Religions and Denominations, Qom, Iran

Biostatistics Consulting Center, College of Medicine, National Cheng Kung University Hospital, National Cheng Kung University, Tainan, Taiwan

Faculty of Nursing, Universitas Padjadjaran, Sumedang, Indonesia

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: Y-CL, J-SC, NB, KSO, I-CL, JDL, C-YL; Data curation: Y-CL, I-CL, C-YL; Formal analysis: C-YL; Funding acquisition: Y-CL, I-CL, C-YL; Investigation: Y-CL, J-SC, I-CL, C-YL; Methodology: CS, M-CT, XCCF, J-KC, C-YL; Project administration: C-YL; Resources: Y-CL, J-SC, I-CL, C-YL; Software: C-YL; Supervision: C-YL; Validation: NB, KSO, CS, M-CT, XCCF, J-KC, I-CL, JDL, C-YL; Visualization: C-YL; Roles/Writing - original draft: Y-CL, C-YL; Writing - review & editing: J-SC, NB, KSO, CS, M-CT, XCCF, J-KC, I-CL, JDL, C-YL. All authors have reviewed and agreed to their individual contribution(s) before submission.

Corresponding authors

Correspondence to I-Ching Lin or Chung-Ying Lin .

Ethics declarations

Ethics approval and consent to participate.

All participants from obtained the detailed information regarding the study purpose and informed consent was obtained. The study protocol was approved by the National Cheng Kung University Human Research Ethics Committee (Approval No. NCKU HREC-E-110-486-2) and the National Cheng Kung University Hospital Institute of Review Board (IRB No. A-ER-111-445). This study was obtained online informed consent from all participants before data collection and study methods were carried out in accordance with the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Lin, YC., Chen, JS., Bevan, N. et al. Chinese version of the Tendency to Avoid Physical Activity and Sport (TAPAS) scale: testing unidimensionality, measurement invariance, concurrent validity, and known-group validity among Taiwanese youths. BMC Psychol 12 , 381 (2024). https://doi.org/10.1186/s40359-024-01870-y

Download citation

Received : 24 December 2023

Accepted : 24 June 2024

Published : 08 July 2024

DOI : https://doi.org/10.1186/s40359-024-01870-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Factor analysis
  • Physical activity
  • Reliability
  • Young adults

BMC Psychology

ISSN: 2050-7283

research validity meaning

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • What Is Criterion Validity? | Definition & Examples

What Is Criterion Validity? | Definition & Examples

Published on September 2, 2022 by Kassiani Nikolopoulou . Revised on June 22, 2023.

Criterion validity (or criterion-related validity ) evaluates how accurately a test measures the outcome it was designed to measure. An outcome can be a disease, behavior, or performance. Concurrent validity measures tests and criterion variables in the present, while predictive validity measures those in the future.

To establish criterion validity, you need to compare your test results to criterion variables . Criterion variables are often referred to as a “gold standard” measurement. They comprise other tests that are widely accepted as valid measures of a construct .

The researcher can then compare the college entry exam scores of 100 students to their GPA after one semester in college. If the scores of the two tests are close, then the college entry exam has criterion validity.

When your test agrees with the criterion variable, it has high criterion validity. However, criterion variables can be difficult to find.

Table of contents

What is criterion validity, types of criterion validity, criterion validity example, how to measure criterion validity, other interesting articles, frequently asked questions about criterion validity.

Criterion validity shows you how well a test correlates with an established standard of comparison called a criterion.

A measurement instrument, like a questionnaire , has criterion validity if its results converge with those of some other, accepted instrument, commonly called a “gold standard.”

A gold standard (or criterion variable) measures:

  • The same construct
  • Conceptually relevant constructs
  • Conceptually relevant behavior or performance

When a gold standard exists, evaluating criterion validity is a straightforward process. For example, you can compare a new questionnaire with an established one. In medical research, you can compare test scores with clinical assessments.

However, in many cases, there is no existing gold standard. If you want to measure pain, for example, there is no objective standard to do so. You must rely on what respondents tell you. In such cases, you can’t achieve criterion validity.

It’s important to keep in mind that criterion validity is only as good as the validity of the gold standard or reference measure. If the reference measure suffers from some sort of research bias , it can impact an otherwise valid measure. In other words, a valid measure tested against a biased gold standard may fail to achieve criterion validity.

Similarly, two biased measures will confirm one another. Thus, criterion validity is no guarantee that a measure is in fact valid. It’s best used in tandem with the other types of validity .

Prevent plagiarism. Run a free check.

There are two types of criterion validity. Which type you use depends on the time at which the two measures (the criterion and your test) are obtained.

  • Concurrent validity is used when the scores of a test and the criterion variables are obtained at the same time .
  • Predictive validity is used when the criterion variables are measured after the scores of the test.

Concurrent validity

Concurrent validity is demonstrated when a new test correlates with another test that is already considered valid, called the criterion test. A high correlation between the new test and the criterion indicates concurrent validity.

Establishing concurrent validity is particularly important when a new measure is created that claims to be better in some way than its predecessors: more objective, faster, cheaper, etc.

Remember that this form of validity can only be used if another criterion or validated instrument already exists.

Predictive validity

Predictive validity is demonstrated when a test can predict future performance. In other words, the test must correlate with a variable that can only be assessed at some point in the future, after the test has been administered.

For predictive criterion validity, researchers often examine how the results of a test predict a relevant future outcome. For example, the results of an IQ test can be used to predict future educational achievement. The outcome is, by design, assessed at some point in the future.

A high correlation provides evidence of predictive validity. It indicates that a test can correctly predict something that you hypothesize it should.

Criterion validity is often used when a researcher wishes to replace an established test with a different version of the same test, particularly one that is more objective, shorter, or cheaper.

Although the original test is widely accepted as a valid measure of procrastination, it is very long and takes a lot of time to complete. As a result, many students fill it in without carefully considering their answers.

Criterion validity is assessed in two ways:

  • By statistically testing a new measurement technique against an independent criterion or standard to establish concurrent validity
  • By statistically testing against a future performance to establish predictive validity

The measure to be validated, such as a test, should be correlated with a measure considered to be a well-established indication of the construct under study. This is your criterion variable.

Correlations between the scores on the test and the criterion variable are calculated using a correlation coefficient , such as Pearson’s r . A correlation coefficient expresses the strength of the relationship between two variables in a single value between −1 and +1.

Correlation coefficient values can be interpreted as follows:

  • r = 1: There is perfect positive correlation
  • r = 0: There is no correlation at all.
  • r = −1: There is perfect negative correlation

You can automatically calculate Pearson’s r in Excel , R , SPSS or other statistical software.

Positive correlation between a test and the criterion variable shows that the test is valid. No correlation or a negative correlation indicates that the test and criterion variable do not measure the same concept.

You give the two scales to the same sample of respondents. The extent of agreement between the results of the two scales is expressed through a correlation coefficient.

You calculate the correlation coefficient between the results of the two tests and find out that your scale correlates with the existing scale ( r = 0.80). This value shows that there is a strong positive correlation between the two scales.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Criterion validity and construct validity are both types of measurement validity . In other words, they both show you how accurately a method measures something.

While construct validity is the degree to which a test or other measurement method measures what it claims to measure, criterion validity is the degree to which a test can predictively (in the future) or concurrently (in the present) measure something.

Construct validity is often considered the overarching type of measurement validity . You need to have face validity , content validity , and criterion validity in order to achieve construct validity.

When designing or evaluating a measure, construct validity helps you ensure you’re actually measuring the construct you’re interested in. If you don’t have construct validity, you may inadvertently measure unrelated or distinct constructs and lose precision in your research.

Construct validity is often considered the overarching type of measurement validity ,  because it covers all of the other types. You need to have face validity , content validity , and criterion validity to achieve construct validity.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research, you also have to consider the internal and external validity of your experiment.

Face validity is important because it’s a simple first step to measuring the overall validity of a test or technique. It’s a relatively intuitive, quick, and easy way to start checking whether a new measure seems useful at first glance.

Good face validity means that anyone who reviews your measure says that it seems to be measuring what it’s supposed to. With poor face validity, someone reviewing your measure may be left confused about what you’re measuring and why you’re using this method.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Nikolopoulou, K. (2023, June 22). What Is Criterion Validity? | Definition & Examples. Scribbr. Retrieved July 8, 2024, from https://www.scribbr.com/methodology/criterion-validity/

Is this article helpful?

Kassiani Nikolopoulou

Kassiani Nikolopoulou

Other students also liked, what is convergent validity | definition & examples, what is content validity | definition & examples, construct validity | definition, types, & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

IMAGES

  1. School essay: Components of valid research

    research validity meaning

  2. What Is Validity In Research Methodology

    research validity meaning

  3. Validity In Research

    research validity meaning

  4. Types of validity in research

    research validity meaning

  5. Validity

    research validity meaning

  6. Validity In Research

    research validity meaning

VIDEO

  1. Validity and Reliability in Mixed-Methods Research

  2. Validity vs Reliability || Research ||

  3. Validation Of Research Instruments

  4. What is Internal Validity and External Validity?

  5. RELIABILITY AND VALIDITY IN RESEARCH

  6. Validity and Reliability in Research: The Smaller and BIGGER Picture Conceptions

COMMENTS

  1. Reliability vs. Validity in Research

    Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world. High reliability is one indicator that a measurement is valid.

  2. The 4 Types of Validity in Research

    When a test has strong face validity, anyone would agree that the test's questions appear to measure what they are intended to measure. For example, looking at a 4th grade math test consisting of problems in which students have to add and multiply, most people would agree that it has strong face validity (i.e., it looks like a math test).

  3. Validity

    Examples of Validity. Internal Validity: A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases. External Validity: A study on educational interventions that can be applied to different schools across various regions. Construct Validity: A psychological test that accurately measures depression levels.

  4. Validity In Psychology Research: Types & Examples

    In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it's intended to measure. It ensures that the research findings are genuine and not due to extraneous factors. Validity can be categorized into different types, including construct validity (measuring the intended abstract trait), internal validity (ensuring causal conclusions ...

  5. What is Validity in Research?

    Validity is an important concept in establishing qualitative research rigor. At its core, validity in research speaks to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure or understand. It's about ensuring that the study investigates what it purports to investigate.

  6. Reliability vs Validity in Research

    Revised on 10 October 2022. Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure. It's important to consider reliability and validity when you are ...

  7. Validity in Research and Psychology: Types & Examples

    In this vein, there are many different types of validity and ways of thinking about it. Let's take a look at several of the more common types. Each kind is a line of evidence that can help support or refute a test's overall validity. In this post, learn about face, content, criterion, discriminant, concurrent, predictive, and construct ...

  8. Validity in Research: A Guide to Better Results

    Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge. Studies must be conducted in environments ...

  9. The 4 Types of Validity

    Face validity. Face validity considers how suitable the content of a test seems to be on the surface. It's similar to content validity, but face validity is a more informal and subjective assessment. Example: Face validity. You create a survey to measure the regularity of people's dietary habits. You review the survey items, which ask ...

  10. Validity & Reliability In Research

    Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it's supposed to measure. Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions. In short, validity and reliability are both essential to ensuring that your ...

  11. Reliability and Validity

    Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...

  12. Reliability vs. Validity in Research: Types & Examples

    Understanding Reliability vs. Validity in Research. When it comes to collecting data and conducting research, two crucial concepts stand out: reliability and validity. These pillars uphold the integrity of research findings, ensuring that the data collected and the conclusions drawn are both meaningful and trustworthy.

  13. Validity, reliability, and generalizability in qualitative research

    In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of "individual" is seen differently between humanistic and positive psychologists due to differing philosophical perspectives: Where humanistic psychologists believe "individual" is a ...

  14. Validity

    Research validity in surveys relates to the extent at which the survey measures right elements that need to be measured. In simple terms, validity refers to how well an instrument as measures what it is intended to measure. Reliability alone is not enough, measures need to be reliable, as well as, valid. For example, if a weight measuring scale ...

  15. Reliability vs Validity: Differences & Examples

    Reliability and validity are criteria by which researchers assess measurement quality. Measuring a person or item involves assigning scores to represent an attribute. This process creates the data that we analyze. However, to provide meaningful research results, that data must be good.

  16. Validity (statistics)

    Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool (for example, a test in education) is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a ...

  17. Internal Validity in Research

    Internal validity makes the conclusions of a causal relationship credible and trustworthy. Without high internal validity, an experiment cannot demonstrate a causal link between two variables. Research example. You want to test the hypothesis that drinking a cup of coffee improves memory. You schedule an equal number of college-aged ...

  18. Reliability Vs Validity

    Validity; Definition: The degree to which a measurement instrument or research study produces consistent and stable results over time, across different observers or raters, or under different conditions. The degree to which a measurement instrument or research study accurately measures what it is supposed to measure or tests what it is supposed ...

  19. Internal Validity vs. External Validity in Research

    Differences. The essential difference between internal validity and external validity is that internal validity refers to the structure of a study (and its variables) while external validity refers to the universality of the results. But there are further differences between the two as well. For instance, internal validity focuses on showing a ...

  20. Validity, Accuracy and Reliability: A Comprehensive Guide

    Validity refers to how well an experiment investigates the aim or tests the underlying hypothesis. While validity is not represented in this target analogy, the validity of an experiment can sometimes be assessed by using the accuracy of results as a proxy. ... This could mean using instruments that are less susceptible to random errors, which ...

  21. Measuring Reliability and Validity of Instruments

    And, if you are able to find an instrument that has evidence of validity with a population similar to your own (e.g. Hispanic students in an urban middle school), this can provide even greater confidence. Now, let's take a look at what each of these terms mean and how they can be measured. Go to the next page to learn about Reliability.

  22. Construct Validity

    Construct Validity | Definition, Types, & Examples. Published on February 17, 2022 by Pritha Bhandari.Revised on June 22, 2023. Construct validity is about how well a test measures the concept it was designed to evaluate. It's crucial to establishing the overall validity of a method.. Assessing construct validity is especially important when you're researching something that can't be ...

  23. Reconceptualizing the Link Between Validity and Translation in

    The validity of translation in qualitative research is thus based on the equivalence between the original and the translated texts, and correspondingly, uncertainty and differences between the two are treated as threats to validity and trustworthiness. Integrating insights from critical translational theories and Phil Carspecken's critical ...

  24. Evaluation of cross-cultural adaptation and validation of the Persian

    A total of 390 nursing students (mean age = 21.74 (2.1) years; 64% female) participated in the study. Face and content validity were established through feedback from nursing students and expert specialists, respectively. Construct validity was assessed using exploratory factor analysis (EFA) and confirmatory factor analysis (CFA).

  25. What Is Content Validity?

    Content validity evaluates how well an instrument (like a test) covers all relevant parts of the construct it aims to measure. Here, a construct is a theoretical concept, theme, or idea: in particular, one that cannot usually be measured directly. Content validity is one of the four types of measurement validity.

  26. Chinese version of the Tendency to Avoid Physical Activity and Sport

    Psychosocial factors affect individuals' desire for physical activity. A newly developed instrument (Tendency to Avoid Physical Activity and Sport; TAPAS) has been designed to assess the avoidance of physical activity. Considering cultural differences could be decisive factors, the present study aimed to translate and validate the TAPAS into Chinese (Mandarin) for Taiwanese youths, and ...

  27. What Is Criterion Validity?

    Revised on June 22, 2023. Criterion validity (or criterion-related validity) evaluates how accurately a test measures the outcome it was designed to measure. An outcome can be a disease, behavior, or performance. Concurrent validity measures tests and criterion variables in the present, while predictive validity measures those in the future.