Explain the concept of reliability. Explain types of reliability and methods used to calculate each type.
Course: Introduction to Educational Statistics
Course Code 8614
Topics
Concept of Reliability
- Write down the types of reliability and Explain it?
- Inter-rater reliability,test-retest reliability, parallel forms reliability, internal consistency reliability
Answer:
The term reliability in psychological research refers to the consistency of a research study or measuring test. For example, if a person weighs themselves during a day they would expect to see a similar reading. Scales that measured weight differently each time would be of little use. The same analogy could be applied to a tape measure which measures inches differently each time it is used. It would not be considered reliable.
If findings from research are replicated consistently they are reliable. A correlation coefficient can be used to assess the degree of reliability. If a test is reliable it should show a high positive correlation. Of course, it is unlikely the exact same results will be obtained each time as participants and situations vary, but a strong positive correlation between the results of the same test indicates reliability. There are two types of reliability. –Internal and external reliability.
Internal reliability assesses the consistency of results across items within a test. External reliability refers to the extent to which a measure varies from one use to another.
Assessing Reliability
Split-half method
The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires. There, it measures the extent to which all parts of the test contribute equally to what is being measured. This is done by comparing the results of one half of a test with the results of the other half. A test can be split in half in several ways, e.g. first half and second half, or by odd and even numbers. If the two halves of the test provide similar results this would suggest that the test has internal reliability.The reliability of a test could be improved through using this method. For example, any items on separate halves of a test that have a low correlation (e.g. r = .25) should either be removed or rewritten. The split-half method is a quick and easy way to establish reliability. However, it can only be effective with large questionnaires in which all questions measure the same construct. This means it would not be appropriate for tests that measure different constructs.
For example, the Minnesota Multiphase Personality Inventory has sub-scales measuring different behaviors such as depression, schizophrenia, and social introversion. Therefore the split-half method was not an appropriate method to assess reliability for this personality test.
Types of reliability and methods used to calculate each type:
Reliability is a measure of the consistency of a metric or a method. Every metric or method we use, including things like methods for uncovering usability problems in an interface and expert judgment, must be assessed for reliability. In fact, before you can establish validity, you need to establish reliability. Here are the four most common ways of measuring reliability for any empirical method or metric:- inter-rater reliability
- test-retest reliability
- parallel forms reliability
- internal consistency reliability
Because reliability comes from a history of educational measurement (think standardized tests), many of the terms we use to assess reliability come from the testing lexicon. But don’t let bad memories of testing allow you to dismiss their relevance to measuring the customer experience. These four methods are the most common ways of measuring reliability for any empirical method or metric.
Inter-Rater Reliability
The extent to which raters or observers respond the same way to a given phenomenon is one measure of reliability. Where there’s judgment there’s disagreement. Even highly trained experts disagree among themselves when observing the same phenomenon. Kappa and the correlation coefficient are two common measures of inter-rater reliability. Some examples include:
- Evaluators identifying interface problems
- Experts rating the severity of a problem
For example, we found that the average inter-rater reliability [pdf] of usability experts rating the severity of usability problems was r = .52. You can also measure intra-rater reliability, whereby you correlate multiple scores from one observer. In that same study, we found that the average intra-rater reliability when judging problem severity was r = .58 (which is generally low reliability).
Test-Retest Reliability
Do customers provide the same set of responses when nothing about their experience or their attitudes has changed? You don’t want your measurement system to fluctuate when all other things are static. Have a set of participants answer a set of questions (or perform a set of tasks). Later (by at least a few days, typically), have them answer the same questions again.When you correlate the two sets of measures, look for very high correlations (r > 0.7) to establish retest reliability. As you can see, there’s some effort and planning involved: you need participants to agree to answer the same questions twice. Few questionnaires measure test-retest reliability (mostly because of the logistics), but with the proliferation of online research, we should encourage more of this type of measure.
Parallel Forms Reliability
Getting the same or very similar results from slight variations on the question or evaluation method also establishes reliability. One way to achieve this is to have, say, 20 items that measure one construct (satisfaction, loyalty, usability) and to administer 10 of the items to one group and the other 10 to another group, and then correlate the results. You’re looking for high correlations and no systematic difference in scores between the groups.Internal Consistency Reliability
This is by far the most commonly used measure of reliability in applied settings. It’s popular because it’s the easiest to compute using software—it requires only one sample of data to estimate the internal consistency reliability. This measure of reliability is described most often using Cronbach’s alpha (sometimes called coefficient alpha).It measures how consistently participants respond to one set of items. You can think of it as a sort of average of the correlations between items. Cronbach’s alpha ranges from 0.0 to 1.0 (a negative alpha means you probably need to reverse some items). Since the late 1960s, the minimally acceptable measure of reliability has been 0.70; in practice, though, for high-stakes questionnaires, aim for greater than 0.90. For example, the SUS has a Cronbach’s alpha of 0.92.
The more items you have, the more internally reliable the instrument, so to increase internal consistency and reliability, you would add items to your questionnaire. Since there’s often a strong need to have few items, however, internal reliability usually suffers. When you have only a few items, and therefore usually lower internal reliability, having a larger sample size helps offset the loss in reliability.
In Summary
Here are a few things to keep in mind about measuring reliability:
- Reliability is the consistency of a measure or method over time.
- Reliability is necessary but not sufficient for establishing a method or metric as valid.
There isn’t a single measure of reliability; instead, there are four common measures of consistent responses.
- You’ll want to use as many measures of reliability as you can (although in most cases.
- One is sufficient to understand the reliability of your measurement system).
- Even if you can’t collect reliability data, be aware of how low reliability is.
- This may affect the validity of your measures, and ultimately the veracity of your decisions.
Related Topics
No comments:
Post a Comment
If you have any question related to children education, teacher education, school administration or any question related to education field do not hesitate asking. I will try my best to answer. Thanks.