Making Classroom Assessments Reliable and Valid. Robert J. Marzano

Читать онлайн.
Название Making Classroom Assessments Reliable and Valid
Автор произведения Robert J. Marzano
Жанр Учебная литература
Серия
Издательство Учебная литература
Год выпуска 0
isbn 9781945349188



Скачать книгу

to measure.

      Reliability and validity are related in a variety of ways (discussed in depth in subsequent chapters). Even on the surface, though, it makes intuitive sense that validity is probably the first order of business when designing an assessment; if a test doesn’t measure what it is supposed to measure, it is of little use. However, even if a test is designed with great attention to its validity, its reliability can render validity a moot point.

      An assessment’s validity can be limited or mediated by its reliability (Bonner, 2013; Parkes, 2013). For example, imagine you were trying to develop an instrument that measures weight. This is a pretty straightforward construct, in that weight is defined as the amount of gravitational pull on an object or the force on an object due to gravity. With this clear goal in mind, you create your own version of a scale, but unfortunately, it gives different measurements each time an object is placed on it. You put an object on it, and it indicates that the object weighs one pound. You take it off and put it on again, and it reads one and a half pounds. The third time, it reads three-quarters of a pound, and so on. Even though the measurement device was focused on weight, the score derived from the measurement process is so inaccurate (imprecise or unreliable) that it cannot be a true measure of weight. Hence, your scale cannot produce valid measures of weight even though you designed it for that specific purpose. Its reliability has limited its validity. This is probably the reason that reliability seems to receive the majority of the attention in discussions of CA. If a test is not reliable, its validity is negated.

      For CAs to take their rightful place in the assessment triad depicted in figure they must be both valid and reliable. This is not a new or shocking idea; reliability and validity for CAs must be thought of differently from how they are with large-scale assessments.

      Large-scale assessments are so different from CAs in structure and function that the paradigms for validity and reliability developed for large-scale assessments do not apply well to CAs. There are some who argue that they are so different from large-scale assessments that they should be held to a different standard than large-scale assessments. For example, Jay Parkes (2013) notes, “There have also been those who argue that CAs … have such strong validity that we should tolerate low reliability” (p. 113).

      While I believe this is a defensible perspective, in this book, I take the position that we should not simply ignore psychometric concepts related to validity and reliability. Rather, we should hold CAs accountable to high standards relative to both validity and reliability, but educators should reconceptualize the standards and psychometric constructs on which these standards are based in order to fit the unique environment of the classroom. I also believe that technical advances in CA have been hindered because of the unquestioned adherence to the measurement paradigms developed for large-scale assessments.

      Even though validity is the first order of business when designing an assessment, I begin with a discussion of reliability because of the emphasis it receives in the literature on CAs. At its core, reliability refers to the accuracy of a measurement, where accuracy refers to how much or how little error exists in an individual score from an assessment. In practice, though, large-scale assessments represent reliability in terms of scores for groups of students as opposed to individual students. (For ease of discussion, I will use the terms large-scale and traditional as synonyms throughout the text.) As we shall see in chapter 4 (page 83), the conceptual formula for reliability in the large-scale assessment paradigm is based on differences in scores across multiple administrations of a test. Consider table I.1 to illustrate the traditional concept of reliability.

      The column Initial Administration reports the scores of ten students for the first administration of a specific test. (For ease of discussion, the scores are listed in rank order.) The next column, Second Administration (A), and the first represent a pattern of scores that indicate relatively high reliability for the test in question.

      To understand this pattern, one must imagine that the second administration happened right after the initial administration, but somehow students forgot how they answered the items the first time. In fact, it’s best to imagine that students forgot they took the test in the first place. Although this is impossible in real life, it is a basic theoretical underpinning of the traditional concept of reliability—the pattern of scores that would occur across students over multiple replications of the same assessment. Lee J. Cronbach and Richard J. Shavelson (2004) explain this unusual assumption in the following way:

      If, hypothetically, we could apply the instrument twice and on the second occasion have the person unchanged and without memory of his first experience, then the consistency of the two identical measurements would indicate the uncertainty due to measurement error. (p. 394)

      If a test is reliable, one would expect students to get close to the same scores on the second administration of the test as they did on the first. As depicted in Second Administration (A), this is basically the case. Even though only two students received exactly the same score, all scores in the second administration were very close to their counterparts in the first.

      If a test is unreliable, however, one would expect students to receive scores on the second administration that are substantially different from those they received on the first. This is depicted in the column Second Administration (B). Notice that students’ scores vary greatly from their first on this hypothetical administration.

      Table I.1 demonstrates the general process at a conceptual level of determining reliability from a traditional perspective. If the pattern of variation in scores among students is the same from one administration of a test to another, then the test is deemed reliable. If the pattern of variation changes from administration to administration, the test is not considered reliable. Of course, administrations of the same tests to the same students without students remembering their previous answers don’t occur in real life. Consequently, measurement experts (called psychometricians) have developed formulas that provide reliability estimates from a single administration of a test. I discuss this in chapter 3 (page 59).

      Next, we consider the equation for a single score, as well as the reliability coefficient.

      While the large-scale paradigm considers reliability from the perspective of a pattern of scores for groups of students across multiple test administrations, it is also based on the assumption that scores for individual students contain some amount of error. Error may be due to careless mistakes on the part of students, on the part of those administering and scoring the test, or both. Such an error is referred to as a random measurement error, and that is an anticipated part of any assessment (Frisbie, 1988). Random error can either increase the score a student receives (referred to as the observed score) or decrease the score a student receives. To represent this, the conceptual equation for an individual score within the traditional paradigm is:

      Observed score = true score + error score

      The true score is the score a test taker would receive if there were no random errors from the test or the test taker. In effect, the equation implies that when anyone receives a score on any type of assessment, there is no guarantee that the score the test taker receives (for example, the observed score) is the true score. The true score might be slightly or greatly higher or lower than the observed score.

      The reliability of an assessment from the traditional perspective is commonly expressed as an index of reliability—also referred to as the reliability