Bertram C. Bruce
University of Illinois, Champaign, IL 61820
Andee Rubin
TERC, Cambridge, MA 02140
A central goal of this book is to contribute to an understanding of educational change. To that end we have examined how QUILL was realized in different ways in diverse settings. In the preceding chapters on purpose, revision, and network communication we studied the details of the process, because we wanted to understand how the realizations reflected the unique characteristics of QUILL, as well as the particular classrooms in which QUILL was used.
Nevertheless, the general form of the realization process occurs in the introduction of any innovation, whatever the domain. The parameters, constraints, and issues related to change are in large part the same across innovations, even those built around older technologies such as books, paper and pencil, or the blackboard. What should be of greatest interest to those interested in educational change is not the fine details of the technology, but rather, the ways in which the goals, presuppositions, and attitudes of the developers of the innovation interact with those of the people who use it.
We explore in this chapter some implications of the QUILL study for the broader field of educational evaluation. We are concerned with questions such as: Why do educational systems resist change or change in unexpected ways? What is the role of innovations in encouraging change? How can we best analyze the process of change when it does occur? What are the implications of this view for the evaluation of innovations?
Implementing an innovation means introducing something new into an existing system. If the innovation is significant, it will trigger changes in the system, some of which may be easily predictable and others of which may be surprising. People involved with the system naturally want to know what those changes may be and what they mean. The notion Or change that is implied by an innovation thus calls for an evaluation. We are led to a number of questions about the innovation, the two most basic being these: (a) How well does it work? and (b) How can it be improved?
In answering these questions, an evaluator must keep in mind the purpose and audience of an evaluation. Just as a student in a classroom has a variety of audiences and purposes, an evaluator has several possible audiences and purposes for an evaluation. Audiences can include teachers, administrators, parents, researchers, and developers. Purposes include helping people make informed decisions about adoption of an innovation, helping them modify an innovation, and helping them understand how an innovation might be used in a new setting. In thinking about the methods and goals of evaluation we need to take these varieties of purposes and audiences into account.
In the next section we describe summative evaluation, which addresses the first of these questions, and formative evaluation, which addresses the second. We use the evaluations of QUILL to make these descriptions more concrete. The types of evaluation discussed are useful but have many limitations. To extend their scope, researchers have proposed a variety of alternative evaluation methods, some of which we discuss later in the chapter. Each of these methods makes a contribution to the study of educational innovation and change. But even these methods fail to answer a basic question for a potential user: How can the innovation be re-created in one's own setting? This leads us to raise a fundamental issue about the nature of evaluation: What is the "it" being evaluated? An exploration of that issue leads us to call for a new type of evaluation, situated evaluation.
Walberg and Haertel (1990, p. xvii) defined evaluation, and two major types of evaluation as follows:
The term evaluation refers to a careful, rigorous examination of an educational curriculum, program, institution, organizational variable, or policy. The primary purpose of this examination is to learn about the particular entity studied, although more generalizable knowledge may also be obtained. The focus is on understanding and improving the thing evaluated (formative evaluation), on summarizing, describing, or judging its planned and unplanned outcomes (summative evaluation), or both.
Summative evaluation focuses on the impact an innovation has in terms of predefined measures, such as scores on a writing sample. It might, for ex ample, report a substantial increase in the writing scores of students who used the innovation. As such, it addresses the potential user's need to decide whether to adopt an innovation. Formative evaluation focuses on the innovation directly, and addresses the developer's need to learn how to improve an innovation. It might detail comments from users and list changes to be made to the software.
In this section we describe the summative evaluation of QUILL, a rather standard, quantitative, objective, outcome-based assessment of changes in student writing attributable to the use of QUILL. We then discuss the formative evaluation of QUILL. These evaluations are described and then compared in some detail, both to complete our story of the QUILL experience, and to establish a baseline from which to discuss alternative methods of evaluation.
Summative evaluations frequently involve any of a wide range of quantitative methods, but they are not limited to these. In this section we focus on the quantitative methods used in the summative evaluation of QUILL, as an example of standard summative evaluation.
QUILL Field Test. A formal summative evaluation was carried out on QUILL during the 1982-1983 academic year (Bruce & Rubin, 1984). The purpose was to determine whether QUILL could be certified as effective for teaching writing and, if so, for which grade levels and which types of writing. This evaluation, done at the NETWORK, was based on data collected in a field test in Massachusetts (a rural site), Connecticut (an urban site), and New Jersey (a suburban site). The classrooms ranged from third through sixth grade. At each site there were two experimental (QUILL) classrooms and two control classrooms. The data were samples of three types of student writing: exposition (e.g., a description of the procedure to follow for a fire drill), persuasion (e.g., a letter to the principal about obtaining more computers), and expression (e.g., a story about a picture presented to the students).2
The goal of the evaluation was to determine for what grade levels and types of writing QUILL was effective. We judged effectiveness by assessing student writing to see whether there was a significant improvement after using QUILL for 6 months. Naturally, one would expect some improvement in student writing over the course of a school year with or without QUILL. For that reason, we needed to find writing improvement for students in QUILL classrooms that was substantially greater than we could reasonably expect to fin(l in classrooms without QUILL. To do that, we also assessed writing for students in six control classrooms, each matching a QUILL class by grade level and in the same site.
We scored pretest and posttest writing samples from QUILL classes and the matched control classes using a primary-trait scoring system. This system measures the effectiveness of writing in terms of the primary goal or trait of the writing assignment (Mullis, 1980). Each writing sample was scored by two people using a scale from 1 to 4, on the basis of its success in achieving the primary goal of the task. For example, a persuasive piece should include reasons for the argument it makes. If no reasons were given, the piece would score l; if several well-articulated reasons were given, it would score 4. Ratters were trained for 2 days. Disagreements in the ratings, which occurred for approximately 20% of the samples, were resolved by a third rater. In every case of disagreement, the first two ratings differed by no more than one unit, and the third rating matched one of the first two. Ratters did not know the class from which any piece of writing came (see Bruce & Rubin, 1984, for more details).
This rather straightforward design yielded, as we show below, useful results within the framework for which it was intended. QUILL was shown to be effective for teaching writing at three grade levels and for two of the functions. But it is important to interpret these results carefully. I here were problems with the data collection. Moreover, as we discuss later in this section, the measures captured only a small aspect of the effect of QUILL, and entirely missed unanticipated changes in the classroom.
One problem was that the data from the sixth-grade classroom pair in Connecticut could not be considered because the teacher for the control class did not have her class complete the posttest writing sample. (As a "control group" teacher she may have felt that the posttest had little direct benefit for her class.) This was particularly unfortunate, because the corresponding experimental class was one we felt made especially interesting use of QUILL. We, of course, had other things to report about the class, but our summative evaluation paradigm could not include this classroom.
Preliminary analyses of the data showed greater improvement in expressive writing for the QUILL classes than for the control classes, but the difference was not statistically significant. Because the purpose of the evaluation was to certify QUILL's effectiveness and to specify precisely the areas in which it was demonstrated to be effective in the field test, we performed subsequent analyses on only the persuasive and the expository scores.
The first analysis of the data was done to determine which classrooms showed significant growth between pretest and posttest (see Table 8.1). On the expository writing samples, for each of the grades three, four, and five, QUILL students achieved gains between the pretest and posttest of 0.6 to 1.3 primary-trait scoring units (0.89 to 2.3 standard deviations). All of these gains were statistically significant (p < .01). In contrast, only the grade four control group showed a statistically significant gain (0.5 units; 0.75 S D). Similar results were obtained for persuasive writing. All three grade levels of QUILL classes showed statistically significant (p < .01) gains (0.58 to 0.74 units; 0.82 to 1.45 S D), whereas only the grade four control groups did (0.52 units; 1.14 SD).
A second analysis of the data was done comparing the posttest scores of the QUILL and control groups for both persuasive and expository writing and for each grade level, to see whether QUILL students were writing significantly better than comparable control group students. A correlated l-test analysis of the data demonstrated that the differences were statistically significant (p < .01) in every case, including the two grade four comparisons. The pretest scores were also analyzed to verify comparability between QUILL and control classes. Only one of the classroom pairs (third grade, expository) showed a significant difference on the pretest. For that case, the control group dropped from 2.08 to 2.04 between pretest and posttest, whereas the QUILL group gained from 1.33 to 2.63. On the basis of this summative evaluation, the National Diffusion Network certified the use of QUILL for grades 3-5.
TABLE 8.1
T-test Results for Pre to Post Gains in Writing
| Expository | Persuasive | |||||||
|---|---|---|---|---|---|---|---|---|
| Grade | Class | Pre | Post | Pre | Post | |||
| 3 | Q | 1.33 | 2.63* | 1.54 | 2.28* | |||
| 4 | Q | 1.95 | 2.55* | 1.97 | 2.64* | |||
| 5 | Q | 2.27 | 3.00* | 2.29 | 2.87* | |||
| 3 | C | 2.08 | 2.04 | 1.50 | 1.88 | |||
| 4 | C | 1.60 | 2.10* | 1.72 | 2.24* | |||
| 5 | C | 2.20 | 2.44 | 1.92 | 2.11 |
Other Summative Evaluations. The QUILL field test provided quantitative evidence that students in grades three to five who used QUILL improved their ability to write well for both exposition and persuasion. Other summative evaluations were also done. A study in a classroom (grades 4-6) in Shungnak also showed statistically significant improvement in students' writing using the primary-trait scoring. In contrast to the other summative evaluation, the greatest improvement was in expressive writing. Moreover, the students' post test writing samples were almost twice as long as their pretest samples. Perhaps most significantly, at the end of the year students asked to have their writing samples back so that they could continue working on them. They no longer thought of writing as a school task to be completed in 20 minutes, but as a creative process that they had the right and responsibility to complete. This change in attitude towards writing was not, however, assessed in the summative evaluation.
Writer's Assistant (published by Interlearn), the text editor used in QUILL, was itself the subject of a summative evaluation. Levin, Boruta, and Vasconcellos (1983) worked with two classrooms, one using Writer's Assistant for 4 months, and one not. Students generated samples of writing using pencil and paper both before and after using the computer. In the experimental (Writer's Assistant) class, there was a 64 % increase in writing sample length versus 4 % in the control class. On a 4-point holistic scale, the experimental group's scores increased from 2.00 to 3.09, between pretest and posttest, whereas the control group decreased slightly, from 2.27 to 2.24.
Summary. These assessments are representative of summative evaluation in general. Given a well-defined measure, such as a primary-trait score for a writing sample, one can structure a study to assess whether there was a significant change following the introduction of the innovation, and contrast that change with the change found in a control group. The summative evaluation of QUILL provided evidence that QUILL "worked"; that is, for specified grade levels and types of writing, students in QUILL classrooms improved their writing skills more than did students in comparable non QUILL class rooms. These results were useful to those making decisions about using QUILL or supporting its dissemination.
Formative evaluation is a second widely used method for looking at innovations. Here, the audience is not the end user, but rather the developer of the innovation. The developers introduce the innovation into a suitable context, or a small number of such contexts. They then monitor its use to determine how different features work, with the goal being to make appropriate modifications to the innovation. The methods are typically observations and interviews.
For example, suppose the developers observe that one student has difficulty deciphering a particular screen display. In the formative evaluation process this would probably be taken as a sign that the display should be examined and possibly modified. Since the developers are still engaged in shaping the innovation, they cannot afford to ignore any indicators of how the innovation functions, even without a formal statistical analysis. In contrast, in a summative evaluation, the point is to assess how the innovation as a whole achieves its goals. One takes the innovation as fixed, ignores the details, and looks for overall effects. A single student's difficulty in understanding the screen might never be noticed in a summative evaluation. In what follows, we relate several episodes in QUILL's development/formative evaluation cycle. Each of these represent changes we made in QUILL as a result of experiences with its use. Many of these changes seem obvious in retrospect but only became clear in the process of formative evaluation.
The Process Tree Interface. Early in the development of QUILL we experimented with an interface in which aspects of the writing process were made explicit (see Fig. 8.1). The user would specify where he or she was in the writing process and the computer would provide the appropriate tools. We thought this might (a) help the students develop a metacognitive awareness of the writing task, and (b) keep the students focused on writing goals rather than on computer procedures.
What we found was that our "process tree" became too bushy. The distinctions it called for were often too fine or inappropriate for the writer at a given point in the writing and multiple paths to the same computer procedure created confusion. These problems were apparent in the initial trial period and led us to revise our conception of the interface. In the end we still maintained an aspect of the original multilevel design by asking users to specify the writing environment they wanted to use (PLANNER/LIBRARY/MAILBAG) rather than the tool itself (editor, mail system, keywords, etc.), but the final design was simpler to use and better in that it did not imply a fixed writing process.
FIG 8.1 The writing process tree interface used in an early version of QUILL.
Distinction Between Revision and Editing. Another shift in emphasis came about with experience in the use of QUILL for revision. As we saw revision being reduced to copyediting in many classrooms, we realized that we had not provided a clear model for encouraging revision in either the QUILL Teacher's Guide or the teacher training sessions. This led us to emphasize revision more in later work with teachers. It is noteworthy that the rejected process tree had made distinctions among various aspects of revising and editing These contradictory modifications in QUILL's view of revision and editing indicate why it is useful to think of development and evaluation as a continuous process, in which the "freezing" of an innovation occurs at a point dictated more by timing and economics than by any sense of complete knowledge about its effects.
Form Letters. Another cycle of formative evaluation occurred in our development of MAILBAG. We originally thought that a library of forms for letters would be a useful resource for students. We considered including memos, personal letters, business letters, and perhaps, finer subdivisions of letter types. The library would then be both a handy tool to assist in the letter writing process, and a tool for instruction, since it would illustrate the ideal schemata for different writing purposes. The computer would automatically generate portions of a letter once the student had specified his or her goals.
This idea turned out to be unsuccessful. Within the confines of the Apple II computer and the need to have a simple, easy-to-use interface, we offered only a few choices of letter types. For inexperienced writers, even this selection was confusing. In order to choose the appropriate letter type, the students had to know almost as much as the computer provided. Thus many students were either confused by the options or felt them to be an unnecessary burden. Worse still, the forms constrained writers in unfortunate ways. In some cases, writers wanted to contradict the form, and their efforts often resulted in bizarre texts: Figure 8.2 shows part of a personal letter with the student's own closing ("Love, "), followed by the computer generated closing ("Sincerely, ").
FIG. 8.2. A closing produced by an early version of MAILBAG.
Based on these experiences, we abandoned the form letter approach. In deciding to make this change, we relied on the reactions of only a few users. There was no need for a formal statistical analysis, since the difficulties revealed in practice served as convincing existing proof that our design needed reworking. This kind of tinkering is typical of the development/formative evaluation cycle. If our goal had instead been to explore general interface issues, we might have wanted to determine the reactions of a larger number of users.
Printer Form-Feeds. The heuristic driving the original design of QUILL was that computer tools and environments that were useful in the workplace for people who write professionally might also be useful for students learning to write. In general, this proved to be a productive approach. But there were surprises.
We had set the default for the printer to do automatic form-feeds after printing each text entry. This meant that the printer turned its platen until the top of the next page of a continuous roll of paper was in place after printing each composition. Thus, users would get their texts on separate sheets of paper. In the workplace context this is done without question. Users want their products on separate pieces of paper and like the convenience of having the computer advance the pages automatically.
But in some schools there was a critical shortage of paper. When the aver age student text in the early grades was only a few lines long, automatic form feeds resulted in printouts with a few lines at the top and 6 or 7 inches of blank space. This meant that the school would be using several times as much paper as absolutely necessary. Although the default setting could be overridden, sever al people in the school wanted no default automatic feed at all, so that paper use could be minimized without any special action being taken.
There was an additional reason for changing this default setting. Some teachers wanted to print out all the texts on one disk, or all the pieces by one student. They found it convenient to have the texts printed successively with minimal separation. Given this and the cost considerations, we changed the default to be "no automatic form-feed," something no industrial or commercial computer printing operation would have ever done.
Number of Copies. In an early version of QUILL we had another surprise that in retrospect seems predictable, but at the time was not. When a student finished working on a piece, QUILL simply asked whether he or she wanted a copy of the piece to be printed out. Students complained. We had encouraged collaboration in writing, including working in pairs at the computer, but provided no natural way for students to get multiple copies of a collaborative piece. Although students could always make extra listings of a text, the process was more cumbersome than it should have been. We quickly changed the program so that multiple copies could be printed easily. Then, teachers complained. They discovered students were printing out many more copies than they needed, thus wasting precious paper. We had to find an appropriate compromise between cost and convenience. The "final" version of the program allowed students to print up to 10 copies at a time.
Prominence of MAILBAG. Formative evaluation can include much more than just the acceptance or rejection of particular design ideas. It can also result in qualitative changes in how aspects of the innovation fit together. This was true of the history of MAILBAG. Originally, we had viewed MAILBAG as just one among many methods for publishing or distributing writing. As such, it played a rather minor role in our early conceptions of QUILL, not even appearing on the first three levels of the process tree. As we observed students using the program and received feedback from teachers, we came to give MAILBAG an ever more central role in both the software and the activities. In the end, it was one of the three major programs in QUILL, had a prominent role in the QUILL Teacher's Guide, and was introduced in teacher training sessions as an appropriate first activity for both teachers and students.
Summary. These stories are but a small sample of those pertaining to changes that we made to QUILL as a result of early observations of its use in classrooms. They show how formative evaluation typically proceeds, basically as a trial-and-error process in which the innovation is repeatedly revised in response to experiences with its use. The emphasis on experience with use and the concern for modifying details of the innovation means that formative evaluation usually reveals more about the process of use than does summative evaluation. But because the focus in formative evaluation is on improving the innovation, there is little attention paid to variations in use, nor is there a concern with long-term changes in the social context of use or in the ways the innovation is assimilated. These issues cannot be ignored if we want to understand how an innovation would be realized in a given context.
Some of the differences between formative and summative evaluation as they are usually carried out are summarized in Table 8.3. The categories listed in the first column are useful ones for examining any kind of evaluation; we will return to them when we discuss situated evaluation.
Focus. Summative evaluations are concerned with the effects of using an innovation. Thus, a summative evaluation assesses changes in, say, students' learning of a new concept, and treats the technical details of the innovation as if they were in a black box. In contrast, formative evaluations tend to focus on the innovation per se, and particularly on the innovation as a set of new technologies to be debugged. Although the ultimate goal may be to bring about some change in the users or the setting of use, the immediate focus is on the technology per se.
TABLE 8.3
Differences Between Formative and Summative Evaluation
__________________________________________________________________________________ Formative Summative Focus Innovation Effects of the innovation Audience Developer User Purpose Improve the innovation Decide whether to adopt innovation Variability of Minimized to highlight Controlled by balanced design or Settings technology random sampling Measurement Tools Observation/interview/ Experiment survey Time of Assessment During development After development Results List of changes to Table of measures contrasting the innovation groups __________________________________________________________________________________
Audience. One key distinction between the two types of evaluation pertains to the audience for the results. Summative evaluation results are often published so that any of a large number of potential users can make informed decisions about the innovation. In contrast, formative evaluation is done for (and by) the developers or implementors so that they can make improvements to the innovation. They typically make changes as needed and do not report the results outside of a small community.
Purpose. Evaluations are done for some purpose, usually one that includes a specific action with respect to the innovation. This action is of course de pendent on the audience for the evaluation. For summative evaluation, the action is the potential user's: decide whether to adopt or continue use of the innovation. In the case of formative evaluation, the audience is the developers and the action is to improve the innovation based on experiments with its use.
Variability of Settings. In doing a summative evaluation, one focuses on the value of the innovation. Thus, one looks for controlled variation in the settings in which the innovation is implemented. If the settings are not all the same, there is nevertheless a preference for, say, a balance (e.g., among rural, urban, and suburban settings). One needs to assume that variations in use can either be attributed to fairly well-understood causal factors, or that random variation will be of no consequence with a sufficiently large sample size. The study is structured to constrain the effects of context in order to say more about the effects of the innovation itself. In doing a formative evaluation, there is a similar concern for controlled variation. Because the primary concern is to improve the innovation, one wants contexts that are typical, representative, or that at least reveal meaningful strengths and weaknesses of the innovation, one in which the innovation is used as intended by the developers. "Non standard" uses are not particularly informative at this stage, and could even induce changes that are inappropriate for the majority of users.
Measurement Tools. A variety of tools can be employed for either type of evaluation, so it is simplistic to imply that there is a one-to-one correspondence between measurement tools and evaluation types. Nevertheless, certain tools are typically associated with particular types of evaluation. Because summative evaluations often seek quantitative, statistically significant results, they are usually conducted within a formal experimental design. In contrast, formative evaluation does not often call for quantitative results. Instead, the personal reactions elicited by interviews and observations are usually the most useful.
Time of Assessment. The evaluation types can also be distinguished by their time of application. Summative evaluation is performed after the development has reached a stopping point, whereas formative evaluation, by definition, is carried out during development.
Results. Summative evaluations typically yield quantitative results with quantitative bounds on the possible "error of measurement." These results can be stated concisely, and are often represented by a table or graph. Formative evaluations typically produce qualitative results, such as a list of changes to be made to the innovation.
Despite these differences, the distinction between summative and formative evaluation is not always clear in practice. Suppose, as is usually the case, that a summative evaluation identifies some strengths as well as some weaknesses of the innovation. A potential user might simply weigh these strengths and weaknesses in order to decide whether to adopt the innovation. But if the developers had insights into the reasons for the measured effects, they could use the same results to guide a revision of the innovation. Thus, what for the user was a summative evaluation could be viewed by the developer as a part of a cycle in which each evaluation points to areas of needed improvement. One could even think of formative evaluation as a series of microsummative evaluations of portions of the innovation, with the aim of identifying the areas in which revision is most needed. Viewed this way, summative evaluation is a feedback mechanism for formative evaluation, but only if one understands why the innovation performed as it did.
In other situations, formative evaluation can yield summative-type results. Data collected in order to guide revisions of the innovation can also be integrated for the purposes of a summative assessment. For example, the number of times a feature of the software was used would be relevant to a formative evaluation; it might be used later in a summative evaluation to stratify a sample into frequent and infrequent users.3 One of the distinctions, then, between formative and summative evaluations is simply whether the data are interpreted as feed back to the developer for changes or as a final assessment addressed to the user.
R. M. Wolf(1990) described three key limitations. First, most evaluations do not identify the reasons for the observed phenomena. Thus, they do not say how the innovation can be improved, nor what aspect of it produced the measured effects. Second, not being able to account for why changes occur means that it is questionable to generalize to other settings in which the innovation might be used. Third, the development process often continues after the evaluation, so that most evaluations are effectively of innovations that no longer exist. Again, without knowing more about the situation and process Or use one cannot say whether initial results are still valid for the changed innovation.
Consider, for instance, the results in the QUILL study showing differential improvement across writing functions. The principal summative evaluation round "effectiveness" for expository and persuasive writing, but not for expressive writing. Within the framework in which QUILL was being certified, these writing function distinctions were pertinent and easily quantifiable. It is plausible that they resulted from the emphasis within QUILL activities on a variety of purposes and audiences. Students who wrote newspaper articles, letters, reviews, brochures, editorials, and other types of texts may have learned how to write appropriately for different functions, whereas students who wrote in only one genre might not have developed a sensitivity to writing function differences. And if, as we believe, writing activities of standard classrooms at that time emphasized expressive writing, then there would be less likelihood of significant improvement for that function.4
It is likely that the types of writing for which students show greatest improvement are those they practice. The writing function distinctions found in the QUILL evaluation may thus reflect the distribution of writing across functions in QUILL classes. If this is so, one might conclude that QUILL was generally useful for learning writing and that the areas of greatest improvement would be those the students practiced. This suggests actively promoting writing using the computer with functions in which one wants to see specific improvement.
On the other hand, the results may simply reflect an overall improvement in writing ability. QUILL students probably wrote more and this alone may have made them better writers, regardless of the function. If that is the case, then active promotion of a function of writing might not produce any greater difference for that type of writing. If contriving activities to exercise that function diminished students' roles in selecting topics, their thinking about differences among purposes and audiences, or the overall amount of writing, it could even have negative effects on writing development. The summative results alone do not support a choice between these or other conflicting hypotheses, which have important practical implications. Although they highlight differences among classrooms, they do nothing to clarify why the differences might exist. A similar point holds for the grade distinctions found in the summative evaluation. There, the vagaries of data collection caused us to discard data from one of our most interesting classes, one in which the greatest amount of writing occurred. Even if we had included this classroom, the summative evaluation methodology would have provided little insight into grade-level differences.
A related point is that in order to assess before/after changes the evaluator needs to know the measure at the beginning of the evaluation period. This means that many of the most intriguing effects cannot be measured because they are unanticipated. For example, revision in some QUILL classrooms occurred not just because Writer's Assistant facilitated the mechanical act of editing, but because QUILL catalyzed changes in the social organization of writing, for example, by stimulating more collaboration (see chapter 6; also, Bruce, Michaels, & Watson-Gegeo, 1985). Yet, we did not measure the degree of collaboration in classrooms before they used QUILL, so we could not evaluate the changes that occurred.
Most of these limitations have been recognized by others, and various solutions have been proposed. These solutions are typically put forth as alternative methods of evaluation. They represent variations on the values for the categories in the comparison chart given in the previous section. For example, adversary evaluation (Clyne, 1990) and judicial evaluation (R. L. Wolf, 1990) entail that the audience for summative evaluation is not only the user, but other evaluators presenting an opposing viewpoint. Decision-oriented evaluation (Borich, 1990), goal-free evaluation (Stecher, 1990), and illuminative evaluation (Parlett, 1990) vary the purpose for the evaluation, from responding to the potential user's stated criteria to revealing whatever one call find about the innovation. Naturalistic evaluation (Dorr-Bremme, 1990) and case study methods (Stenhouse, t990) al low for a greater variability of settings. Other methods similarly vary the types of results produced, the time of assessment, or the measurement tools.
We discuss only a few of these alternative evaluation methods here,5 relating them to R. M. Wolf's (1990) three key limitations described above. We will look first at some methods that attempt to assess why changes occurred, as well as to document that they occurred. Second, we will consider the issue of generalization, looking at an approach for studying the use of innovations across settings. We will pass over the third issue, that innovations themselves change after evaluation, because little has been done to address it. Each of the methods discussed makes a valuable contribution to the evaluation problem but, when used within the standard frameworks, cannot escape their inherent limitations.
It is noteworthy that although there is considerable disagreement among all these methods over how to evaluate an innovation, there is a general consensus about what is to be evaluated, namely, that the evaluation should be of the innovation, and that "innovation" is a meaningful, well-defined term. We return to this issue in the next section.
Others have argued for broadening the range of measurement tools used for summative evaluation, specifically to include qualitative measures and results. Miles and Huberman (1984), for example, presented a variety of qualitative methods for use in summative evaluation. These methods include interviews, observations, surveys, and self-reports. They typically result in verbal descriptions of effects of the innovation; or, sometimes, visual displays such as networks to show causal relationships between factors in the situation and the implementation of the innovation; or diagrams that show variations in use along two dimensions. With these methods both the measures and the results can be qualitative.
Nevertheless, for many qualitative researchers, it is still the commonalties across cases or settings that are of interest, as it is for standard summative evaluation. Miles and Huberman (1984) stated:
More and more qualitative researchers are using multisite, multicase designs, often with multiple methods. The aim is to increase generalizability, reassuring oneself that the events and processes in one well-described setting are not wholly idiosyncratic.... The researcher uses multiple comparison groups to find out the kinds of social structures to which a theory or subtheory may be applicable. Having multiple sites increases the scope of the study and, thereby, the degrees of freedom. By comparing sites or cases, one can establish the range of generali- ty of a finding or explanation, and, at the same time, pin down the conditions under which that finding will occur. (p. 151)
The overall goal is the same as for strictly quantitative summative evaluations: to assess the usefulness of the innovation. These qualitative approaches maintain the standard summative evaluation goals, audience, and overall methodology. There is still an emphasis on generalizations rather than on contrasts, on "effects" of the innovation rather than on identifying its realizations, and a minimal concern for the details of the innovation.
In fact, many proponents of qualitative methods for evaluation (Miles & Huberman, 1984; Patton, 1980; Van Maanen, 1983) argued that the use of qualitative methods (observations, survey, interview, etc.) simply enlarges the scope of relevant data rather than changing the fundamental structure and purpose of evaluation. Both qualitative and quantitative researchers, they argue, must be concerned with data reduction, display, and drawing conclusions. The general goal in either case is to establish "findings" that are generalizable. With this general approach, if a finding then holds up across many cases it can be deemed solid or true. Idiosyncratic results can be more easily dismissed. For these reasons, qualitative methods are a useful addition to the summative evaluation framework, but they still fail to address many of its limitations.
Another alternative method is responsive evalualion 6 (Stake, 1990), a method that attempts to achieve a better understanding of the process of change by being more sensitive to the perspective of the users of the innovation:
Responsive evaluation is an approach to the evaluation of educational and other programs. Compared to most other approaches it is oriented more to the activity, the uniqueness, and the social plurality of the program.Responsive evaluation is thus particularly sensitive to the interests and values of the variety of participants involved with the innovation. Formative evaluation, for example, can be done in a way that brings the users of the innovation into the development process. Their issues can then be made central to the activity of (re)designing the innovation. Similarly, summative evaluations can be made more responsive by focusing on desired educational results identified by the users of the innovation.
The essential feature of the approach is a responsiveness to key issues, espe- cially those held by people at the site. It requires a delay and continuing adapta- tion of evaluation goal setting and data gathering while the people responsible for the evaluation become acquainted with the program and the evaluation context.
Issues are suggested as conceptual organizers for the evaluation study, rather than hypotheses, objectives, or regression equations. The reason for this is that the term "issues" draws thinking toward the complexity, particularity, and sub- jective valuing already felt by persons associated with the program. (p. 76)
The need to look at variations in situations is essentially the same point as that made by Dukes (1971) in a famous article ("N = I") on the value of psychological experiments with only one subject. Dukes argued that because situations vary greatly, a researcher may learn as much or more by observing one subject in many situations as by observing many subjects in one situation. In effect, representative sampling is applied to problems or situations rather than to subjects:
In fact, proper sampling of situations and problems may in the end be more important than proper sampling of subjects, considering the fact that individuals are probably on the whole more alike than are situations among one another. [Brunswik, 1956, p. 39]
In the QUILL work we conducted case studies in a number of focal class rooms (Loucks-Horsley, French, Rubin, & Starr, 1985). One measurement tool we used was a component checklist (Loucks & Crandall, 1982; Loucks-Horsley & Hergert, 1985). The checklist defined 17 components of QUILL's use that we judged to be useful indicators of its implementation. We group these components in terms of QUILL's pedagogical goals, with an additional category for "classroom management" (see Table 8.4).
For each component, likely variations in classroom settings were identified. The component checklist scheme then called for designating which of the variations represented "ideal" implementations, which were "acceptable," and which were "unacceptable. " The assignment of variations to these categories represented our judgment about which types of use were faithful to the idealization of QUILL.
For example, in terms of "frequency of use," we thought it ideal to use QUILL every day. It was "unacceptable" from the point of view of using QUILL successfully for students to use it once a week or less. More than once, but less than daily, was "acceptable, " but not "ideal. " Thus, this component was defined as shown at the top of Fig. 8.3. The vertical grey line in the figure separates ideal from acceptable uses. The vertical black line separates accept able from unacceptable uses. Note that the last variation includes daily writing by students, but no use of QUILL. In a larger context, most people would judge the daily writing to be desirable, but for the purpose of measuring QUILL's implementation, the non-use of QUILL would have to be as shown, to the right of the black line and thus unacceptable.
TABLE 8.4 Components of QUILL __________________________________________________________________________________ QUILL PC Component ----------------------------------------------------------------------------------- 1. Planning Use of PLANNER 2. Integration of reading Integration with content areas and writing 3. Publishing Sharing writing Writing for different audiences 4. Meaningful communication Use of LIBRARY and MAILBAG Writing in different genres 5. Collaboration Working in pairs 6. Revision Teacher's comments Teaching revision Conferencing Frequency of student revision Nature of student revision Classroom management Frequency of use Scheduling of QUILL Composing at the computer Students using QUILL Classroom structure __________________________________________________________________________________Component checklists can be used to generate a profile of the practices with in a classroom. A teacher, for example, might use the checklist to obtain a profile of how her use of QUILL compared to the "ideal." A difference, perhaps in the amount of revision, would then stand out as an area for further work. The checklists can also be used to assess changes over time. QUILL classrooms, for example, might move from no use of PLANNER to student creation of planners. When the checklists are used to assess the overall level of implementation or to describe the innovation's impact on classrooms, they serve an essentially summative evaluation role.
In the Loucks-Horsley, et al. (1985) study, practice profiles were produced for 10 teachers. Using these profiles along with other data, researchers were able to categorize teachers into four groups: "problematic," "superficial," "solid," and "super" users. This categorization was then used in an analysis of incentives and barriers to implementation. For example, they found that "support and assistance from others can not only eliminate disincentives but can serve to maintain the influence of incentives over the course of the implementation process" (p. 74).
|
Frequency of Use |
Students use QUILL daily. |
Students use QUILL several times a week. |
Students use QUILL once a week or less. |
Students do not use QUILL, but do write a. daily b. several times a week c. once a week or less |
|
Use of PLANNER |
Teacher uses PLANNER in a variety of ways: e.g., creating pre-writing activities for students; having students create PLANNERS for themselves or other students. |
Students create PLANNERS for their own writing assignments or for each other. |
Teacher use PLANNER to create pre-writing activities for students. |
Teacher does not use PLANNER, a. but includes planning activities prior to writing b. and does not use other pre-writing activitie |
|
Writing in Different Genres |
Teacher gives students QUILL writing assignments in several different genres. |
Teacher gives students QUILL writing assignments in one or two genres. |
Teacher does not use QUILL, but a. students typically write in several different genres b. students only write in one or two genres |
|
|
Writing for Different Audiences |
Students use QUILL to write to a variety of real audiences. |
Students rarely use QUILL to write to different audiences. |
Students rarely use QUILL to write to real audiences. |
Teacher does not use QUILL, but a. students write to different audiences b. students write to real audiences |
|
Student Revision |
Student's revision reflects a balance between content and mechanics. |
Student revision focuses only on content. |
Student revision focuses only on mechanics. |
Students do not revise. |
The standard evaluation paradigm often does not support showing why changes occur, how changes are different across settings, or how they relate to changes in the innovation. Alternative methods of evaluation address these problems to a certain extent, but as long as they are used within the standard paradigm they inherit its intrinsic limitations. For example, a set of case studies done within the summative framework often entails the need to express conclusions in terms of a summary statement about "the effects" of using the innovation. Much of the richness of the case studies is lost as users are categorized and aggregate statements are formulated. As long as the focus is on the innovation, it is difficult to circumvent this problem.
The standard evaluation paradigm presupposes the setting in which the innovation is used to be a passive system. It focuses on the innovation per se, on its properties, in the case of formative evaluation, or on its effects, in the case of summative evaluation. Papert (1987b) described this focus as "tech nocentrism." He related it to the child's early focus on the self:
Egocentrism for Piaget does not, of course, mean "selfishness"-it means that the child has difficulty understanding anything independently of the self. Technocentrism refers to the tendency to give a similar centrality to a technical object-for example computers or Logo. This tendency shows up in questions like "What is THE effect of THE computer on cognitive development?" or "Does Logo work?" (p. 23)
One consequence of technocentrism is that the process of change is conceptualized as a function of the innovation alone, or else it is effectively ignored. What is needed is a different focus entirely for the evaluation process, one which we call situated evaluation. Before discussing it in detail, though, we need to step back and ask some fundamental questions about what it is that is being evaluated.
Examples such as those given in earlier chapters make it difficult to maintain a view of innovations as fixed objects that get applied to produce changes in social systems. Instead, they lead us to see innovations as processes, ongoing manifestations of social relations. This calls for an historical perspective in which we follow social changes over time, including those related to the development of innovations. We need to conceive of the adoption of an innovation as a process in which innovations are incorporated into a social system in a complex fashion that may lead to changes in the innovation, the social system, both, or neither.
It is important to make a distinction between what the developers of an innovation intend and what happens when the innovation is realized in a particular social setting. The developers may intend that the innovation modify the social system so that certain desirable characteristics are achieved. They see the innovation set in an idealized context and used in an idealized way. Their vision of the changed social system is thus a idealization. What happens in practice is that the social system may or may not change at all, and if it does change, it may not do so in accord with the developers' goals. Each resulting social system is a realization. The distinction between ideal and real suggests a process, the realization process, whereby the innovation leads to practices potentially different from those intended by the developers.
One possible way to think of the relationships between idealization and realization is to see the idealization as what Plato called (for reasons that do not concern us here) the "fifth entity." By fifth entity, Plato meant the real essence of an object, or the ideal form that lay behind any actual manifestation. For example, any circle one sees is for Plato a mere object with "particular qualities." It imperfectly represents the "real circle," because it has minute straight segments. Thus it is the "opposite" of the fifth entity. The real circle has no straight segments:
Every circle that is drawn or turned on a lathe in actual operations abounds in the opposite of the fifth entity, for it everywhere touches the straight, while the real circle, I maintain, contains in itself neither much nor little of the opposite character.... The important thing is that, as I said a little earlier, there are two things, the essential reality and the particular quality.... (Hamilton & Cairns, 1961, p. 1590)
For Plato, then, the idealization would be to its realizations as the real circle is to its manifestations. From this perspective, the realization process would be seen as generating various distortions, partial maps, images, or shadows of the idealization. Realizations would then be somewhat ephemeral and in consequential, valuable primarily as possible clues to the true structure of the ideal. This view is represented in Fig. 8.4. The solid circle on the left represents the effect of the innovation in an ideal world; the lens represents the realization process, which in this view distorts the ideal form, and the dotted figure on the right represents a particular realization that matches more or less well to the idealization.

FIG. 8.4. A Platonic view of the realization process
It should come as no surprise that we consider this Platonic view to be untenable. Social practices related to the use of an innovation are not imperfect attempts to mimic some ideal form, but are rather the thing itself. Whereas we may contrast the use of an innovation with its idealization, we do not assume that users are imperfectly following preset rules. The situation instead is more akin to Wittgenstein's (1974) language games:
In philosophy we often compare the use of words with games and calculi which have fixed rules, but cannot say that someone who is using language must be playing such a game.-But if you say that our languages only approximate to such calculi you are standing on the brink of a misunderstanding. For then it may look as if what we were talking about were an ideal language. (p. 81)
Wittgenstein goes on to show how language use, not some rigid set of rules, determines meaning. Nevertheless, many continue to search for the vacuum bottle ideal for language: "We think it [the ideal] must be in reality; for we think we already see it there" (Ibid, p. 101). In a similar way, we cannot specify the pure, or ideal, case for the use of an innovation, only its idealization in the minds of the developers. Users inevitably interpret an innovation in distinctive ways, apply it idiosyncratically in their own contexts, and even re-create it to satisfy their own needs.
Again, Wittgenstein's discussion of games is a propos:
We can easily imagine people amusing themselves in a field by playing with a ball so as to start various existing games, but playing many without finishing them and in between throwing the ball aimlessly into the air, chasing one another with the ball and bombarding one another for a joke and so on. And now some one says: The whole time they are playing a ball-game and following definite rules at every throw.The innovation-in-use, like the actions of people playing with a ball, is the phenomenon we want to understand. Thus, a better view of the realization process is that shown in Fig. 8.5. There, the solid shape on the right represents the social practices that emerge after the introduction of an innovation. Its characteristics reflect a history of interacting social processes, of which the innovation is only a latecomer, and one whose effects are shaped by layers and layers of previous events. The dotted circle on the left is the idealization, an imagined system, whose correspondence to the given realization depends as much on the developers' understanding of the context of use as upon the inherent power of the innovation to effect change. In other words, its similarity to the realization depends on the developers' assessment of the underlying social processes in the context of use.
And is there not also the case where we play and-make up the rules as we go along? And there is even one where we alter them-as we go along. (p. 83)

FIG. 8.5. A Wittgensteinian view of the realization process.
The diversity of the realization process is revealed as we examine what happens when an innovation is introduced into various settings. Since the realization of an innovation is different in each setting, one idealization can spawn an in definite number of realizations. Continuing our optics metaphor, we might say that instead of the realization process being a lens, it is a prism that produces a wide spectrum of different realizations (Fig. 8.6). As an innovation comes into being in real settings, it acquires new and unexpected shapes because of the differences between its idealization and its various realizations. It is not only used differently, it is re-created to conform with the goals and norms of the people who use it. (It may be helpful to think of the prism instead as a collection of context lenses, each of which focuses the idealization into a different realization.)
Attending to the use of an entity can open up our perception to new views. This can be seen in the case of a much simpler problem, viewing the Necker cube (Fig. 8.7). This is a visual illusion that is usually thought of as being perceivable in either of two ways. From one perspective it appears as a cube whose nearest face is in the lower left corner of the figure; from another perspective the nearest face is in the upper right corner. Most people tend to see the cube in one way at first, and then, with varying degrees of difficulty, can "flip" it so that the other way becomes apparent.
Using what he calls "experimental phenomenology," Ihde (1977) argued that
FIG. 8.6. Alternate realizations or an innovation produced by a prism or collection of context lenses.
most studies of the Necker cube assume in advance that there are only two ways to perceive the cube, and thus close off any possibility of understanding alternate perceptions. For example, one could also see the cube as a truncated pyramid, being viewed from the top, or alternatively, from the inside looking up. If one is interested in understanding the phenomenon, in this case, different ways of perceiving the Necker cube, then it is essential to adopt a methodology that reveals different ways of perceiving, rather than one that assumes the existence of only two ways. What is needed is, in Ihde's words, an a priori science, a mode of investigation done prior to more formal hypothesis testing.

FIG. 8.7. The Necker cube.
The methodology does not rneasure variables, but instead is a means of identifying new ones.
A similar example was used by Wittgenstein:
You could imagine the illustration appearing in several places in a book, a text book fol instance. In the relevant text something diffrerent is in question every time: here a glass cube, there an inverted open box, there a wire frame of that shape, there three boards forming a solid angle. Each time the text supplies the interpretation of the illustration. But we can also see the illustration now as one thing, now as another.--So we interpret it, and see it as we interpret it. (1974, p. 193)Thus, the "same" object has different meanings in use, and our interpreta tions of those meanings shape what we see.7
We are interested in an a priori type of evaluation that is open to new varia bles and sensitive to alternate uses and interpretations. As Hymes (1974) said in describing "functional " linguistics, the "organization of use discloses additional features and relations [within language structure]" (p. 79). We should understand "discourse structures as situated, that is, pertaining to cultural and personal occasions which invest discourse [structures] with part of their mean ing and structure" (p. 100).
A similar approach is needed for the study of innovations and change in which we recognize the situatedness of any realization of an innovation. A situated study of the uses of an innovation can disclose relations-contradictions, missing elements, patterns of sensitivity to context-within the idealization. Thus, analyses of the idealization (structure) and the rcalizations (function, or use) serve in a dialectical relation to each other; as we study one aspect we come to understand the other better as well. Without such an approach, one can never know if unanticipated changes occur. And these unallticipated changes may turn out to be the most significant for education.
New realizations of an innovation arise in each setting in which it is used. This leads us to conceive of innovations and the technologies within them in an en tirely new way. Moreover, OUI basic evaluation questions (I low well does it work? How can il be improved?) need to be reformulated. The "it" is no longer the innovation (or even what we now call the tdealization), but the innovation in-use, a situation-specific set of social practices. The fundamental question then becomes:
Similarly, other questions one might ask about innovations and social change need to be reformulated. Above, we asked questions such as those on the left in Table 8.5. Reeognizing the richness and the importance Or the realization process leads us to ask new sorts of questions such as those on the right in Table 8.5.
This book is an evaluation of QUILL, but it is neither a summative, nor a formative evaluation. Instead, it is a situaced evaluacion, one that analyzes the varieties of use of the innovation across contexts. The evaluation is focused on the innovation-in-use, and its primary purpose is to understand the differ ent ways in which the innovation is realized. We use the term situacedevaluacion to emphasize the unique characteristics of each situation in which the innovation is used. Our guiding assumption is that the innovation comes into being through use. The object of interest is not the idealized form in the developer's head, but rather, the realization through use. Situated evaluation seeks to characterize alternate realizations of the innovation and to identify new variables. It assumes that measuring predetermined variables is insuffi cient, no matter how well those measurements are made.
TABLE 8.5
Questions About Innovations and Change
_____________________________________________________________________________________________ Old Questions New Questions What can the innovation do? What do people do as they use the innovation? To what extent are the innovation's goals How do social plactiees ehange, in whatever achieved? direction? What constitutes proper, or suecessful, What are the various forms of use of the innovation? the innovation-in-use? How should people or the context of use How should the innovation be changed change in order to use the innovation most and how ean people interact differently with effectivcly? it in order to achieve educational goals? How does the innovation changc the people How does the community fit the innovation using it? into its ongoing history? _____________________________________________________________________________________________
Explain Why the Innovation Was Used the Way it Was. A situated evalu ation can help explain what happened, as opposed to just describing effects.
Predict the Rcsults of Using the Innouation. This explanation can in turn provide the basis for predicting the realization of the innovation in similar con texts, providing the new context is well understood.
Identify Dimensions of Similarity and Difference Among Settings. Examination of a realization of an innovation can reveal characteristics of a setting, such as a teacher's underlying pedagogical philosophy, that might be less visi ble otherwise.
Improve the Use of the Innovation. Users of the innovation can refer to the situated evaluation as they work on improving the use of the innovation. They might find a realization whose setting has similar aspects to their own and specifically adopt practices of that setting. For example, a teacher might have students with low interest in writing start with the QUILL MAILBAG, if that strategy was found to be successful in a similar setting.
Improve the Technology. Developers, likewise, can refer to the situated evaluation as they try to improve the innovation in terms of its interaction with different contexts. In this way, situated evaluation serves as a sort of forma tive evaluation.
Identify VariablesforLaterEvaluation. Finally, a situated evaluation can help structure future observations of an innovation's use. One way it does this is by focussing attention on the most salient dimensions of the innovation with respect to particular contexts. This can be used to guide a complementary sum mative evaluation.
Situated evaluation cannot be proceduralized; it is a process of discovering rela tionships. Nevertheless, we saw patterns in the discovery process emerge as we performed the situated evaluation of QUILL. There were three major aspects of this process. We looked first at the idealization of QUILL, in order to delineate as fully as possible what was intended by the developers. This in cluded analyzing the theoretical underpinnings, the technology, the suggested activities, and the support system for its use. Second, we examined the set tings in which QUILL was to be used. Setting characteristics included the cul tural backgrounds, institutional resources and constraints, the teachers' goals and practices, the students' roles, the nature of academic tasks, and other ele ments of the social environment. Third, we analyzed QUlLL's realizations in different settings and generated hypotheses about how and why these reali zations developed as they did. In what follows, we elaborate on these aspects of situated evaluation.
The Idealization of the Innouation A thorough analysis of the elements of the innovation independent of its use within real settings is part of a situated evaluation because it serves to charac terize how participants in the sctting of use might have perceived the innova tion. It is also an index of the intentions of the developers, people who arc often important participants not only in the initial creation of the innovation, but in its re-creation in context.
In contrast to the priorities for summative evaluation, the innovation is not privileged over any of its realizations; similarity to the idealization does not count as more successful, and non-use can be as important to consider as "faith ful" use. Moreover, the innovation is not seen as an agent that acts upon the users or the setting, but rather as one more element added to a complex and dynamic system. It would be more correct to say that the users act upon the innovation, shaping it to fit their beliefs, values, goals, and current practices. Of course, in that process they may themselves change, and their changes as well as those to the innovation need to be understood as part of the system.
There are several aspects of the innovation that need to be analyzed criti cally (see Fig. 8.8). First, each innovation emerges from a theory, articulated to varying degrees in documents about the innovation. Any educational inno vation has a theory of both learning and teaching. For QUILL, this was present ed in chapter 2 and in earlier articles about QUILL (Rubin & Bruce, 1985, 1986, 1990). The learning theory incorporated ideas about communication and its relation to education and community. I'he teaching theory had specific com mitments to pedagogical principles such as collaboration and purposeful writ ing. These were summarized in terms of concepts such as functional learning environments, and very specifically, QUlLL's six pedagogical goals.
The idealization of an innovation also includes new technologies, if only in the form of texts that imply changes in practices. We conceive of the tech nology broadly. First of all it includes various tools, artifacts, or apparatus, in the case of QUILL, a new software system. Second, it includes prescrip tions for use of the new tools, in this case, the QUILL activities as articulated in the Teacher's Guide. Third, there is a support system for users, for those who are to carry out the new procedures or activities. Obviously, the elabora tion of these elements varies greatly among technologies. For QUILL, they are described in chapter 3: new technology in the form of computer software and hardware, a set of recommended reading and writing activities, and a sup port system for teachers and students. These elements reify the theory; for the users, they are the innovation.8

FIG. 8.8. Elements of an innovation that need to tbe analyzed.
For example, QUILL called for "meaningful communication with real au diences" (PG 4). This meant that the function of "communication" in writ ing should take precedence over the function of "exhibiting skills for the purpose of evaluation. " This was a part of the QUILL theory. The software, for ex ample, MAILBAG, provided a technological environment in which commu nication in writing was not only facilitated, but seen as an appropriate activity for both teachers and students. The QUILL Teacher's Guide, in particular its description of specific activities such as "Confidential Chat," showed proce dures and activities that emphasized communication in writing. Finally, the support system around QUILL included specific elements intended to foster these changes: The training workshop included discussions of illustrative examples from other classrooms, and follow-up help in the classroom included communication-based activities specific to each classroom.
The Settings in Which the innovation Appears The shift in perspective from the view that realizations are distortions of the ideal to one in which realizations are creations that result from active problem-solving has implications for the sorts of questions researchers need to ask in evaluating innovations. With this perspective, the social context in which the innovation is used becomes central. Questions relating to cultural, institutional, and pedagogical contexts need to be addressed. To answer these questions in full is a formidable task, but focusing on a few specific aspects may go far in providing what is needed for a situated evaluation. In the QUILL study we found that cultural, institutional, and pedagogical contexts were all critical in shaping realizations. Of these, the pedagogical context was proba bly the most important.
The cultural context is another important factor in shaping how an innova tion is used. That was one reason why we examined rural, urban, and subur ban settings in the QUILL field test. In the Alaska project we saw a city/village distinction, and also a variety of languages and cultural traditions (as described in chapter 4). We also examined some specific factors related to socioeconom ic status, home life, and previous schooling, such as:
There was a large linguistic diversity across QUILL classrooms. In addi tion, some classes were bilingual and students were able to write in two lan guages. Despite this variation, we did not have evidence that differences among QUILL classrooms could be attributed to linguistic differences. We did find that the topics and audiences of the writing were greatly influenced by the cul ture of the community. In addition, the city/village distinction appeared to have a major impact on how communication was viewed within the classroom. The village classrooms seemed much more receptive to the possibilities of com munication as opposed to just "composition."
A second type of context that needs to be examined is the institutional. Here one needs to examine the ways in which goals and practices in the institution shape, constrain, or direct the use of the innovation, and to look at the resources available to support thc innovation's use. Often, the availability of resolllces or the imposition of constraints at the building or district level has a significant impact on classroom practices. There are several levels of institution to consider: the school district, the school, and the classroom as a mini-institution.
In the QUILL study we looked at a variety of factors at different institu tional levels, including:
A third category of context turned out to be the most important-how the teachers' goals and practices related to the way they incorporated the innova tion into their classrooms. A recent study (Anderson, 1989) suggests that five dimensions can be used to characterize both instructional programs and the classrooms in which these programs are implemented. These dimensions are the following:
For example, in one classroom, a teacher may have adopted as a goal the improvement of scores on a basic skills test. Her own role in relation to this goal might be to convey information to students or to manage practice on these basic skills. The students' roles might be to receive this information and to apply it in daily practice activities. Tasks in such a classroom might include worksheets, and short answer quizzes that correspond to the basic skills test. Finally, the social environment might be one in which students work indepen dently on the worksheets or respond to teacher questions, and the teacher pro vides feedback to the students on the correctness of their work.
In contrast, in another classroom the goals might include self-regulated learn ing and the use of language for communication, rather than evaluatiom The teacher's role would be as a facilitator of student projects. Students would work alone or in collaboration on tasks whose functions were clear and meaningful for them. The tasks might require transformations and extensions of existing knowledge. Accordingly, the social environment would be one in which failure was accepted and stretching beyond the given was valued. Clearly, the incor poration of an innovation like QUII,L would have different results in either of these extreme characterizations. But even subtle variations on these dimen sions can have major effects on the realization, many of which have been described in chapters 5-7.
Characteristics of a social setting, including the cultural background of stu dents and teachers, the institutional practices, constraints, and resources, and the classroom instructional environment-contribute to the different realiza tions of an innovation. In order to understand these realizations, we need to understand these settings in detail. In the QUILL study we collected informa tion on these characteristics in various ways, including observations, interviews, and written reports by teachers. The information we gathered augmented the subsequent interpretations we made of QUlLL's use. Analyses of QUlI,L's use in turn led us to rethink our initial assessment of the settings, even of the categories themselves.
The Realizations of the Innovation
The third aspect of a situated evaluation is to study the realizations of the innovation in different settings. The study of the realizations should attend to the three limitations of the standard paradigm described earlier. First, one should examine the ways the innovation was used and search for the reasons that changes occur. This includes examining whether the idealization was consonant or dissonant with existing social practices. It also includes analyzing how the innovation's use led to new social organizations, as in the emergence of a teacher community around the use of QUILL. Second one should look at the variety of uses across settings, treating each of tllese as an independent re-creation of the innovation, rather than as a data point for an aggregate state ment about the innovation. l'hird, one should examine changes in the design of the innovation brought about by its use and the ways these changes relate to new practices.
In the QUILL studies, we hacl access to a rich, intertextual corpus of materi als for assessing realizations. These included our own field notes, writings by teachers about their classrooms, electronic mail discussing the implementations, student writing, interviews with students and teachers, practice profiles using the QUILL component checklists, and some videotapes. Thus we relied on direct observations, but to a large extent, also on what was already written. This is the typical situation one would find in doing a situated evaluation (cf. Clifford, 1986; Clifford & Marcus, 1986). But even with large amounts oftext available, observations are essential to doing a situated evaluation.
Understanding the Reasons for Change. Extreme variations among reali zations may lead one to feel that no valid generalizations about the innovation are possible. But the variations in use are actually beneficial for a situalc(l c valu ation. The reason is that our goal is not context-free sumrnaries, but rather, hypotheses about how and why the innovation was realized in different ways in different contexts, in other words, the beginnings of understanding the rea sons for change. Thus, situated evaluation seeks to identify new relevant vari ables to study. Through this process the evaluators may reach a deeper understanding of the idealization, elements of the settings, or the realizations, thus obtaining successively more refined analyses of the use of the innovation.
In some cases, we may observe realizations that are similar across diverse settings. This warrants the hypothesis that for that range of settings, the par ticular use is a shared one. An example of this was discussed in chapter 6 on revision. Recall that in virtually all of the QUILL classrooms, Writer's Assis tant was used for copyediting. It was a tool for low-level correcting of texts, even when it was also used for higher level revision. Essentially, this reflects the convenient match between the text-editing capabilities of the computer and teachers' interest in having students learn spelling, grammar, punctuation, and capitalization and having them apply this knowledge in their writing. It also renects the fact that copyediting is not inconsistc-nt with either an emphasis on the mechanics of writing or an emphasis on its communicative functions. 'I'he practices we observed thus support an hypothesis about a common prac tice: All of the Alaska classrooms used the computer for copyediting.
In contrast, the use of the computer for true revision varied greatly. And when revi.sion di(3 occur the reason.s ror it al.so varied. Its appearance was often related to the social organization of the classroom, which in turn was aftected by the presence of previous innovations (the Alaska Writing Project, Alexander's "Notes"), the availability of computers, cross-age grouping, or other situatioll specific factors. Thus, to say that "some" or "many" classroollls usc(l th( computer for revising is empty without a cl1aracterization of why tht rcvision occurred.
The characterizations produced by these analyses can be used in various ways. For users in different contexts they provide reasonable expectations about how the innovation might be realized. They also suggest what to change in the context in order to achieve particular results. For developers, the charac terizations can be used in a formative way, to revise the innovation, perhaps by including more explicit ways to alter the context or to make the innovation more adaptable to different contexts.
Differences Across Settings. As we look for the reasons for change, we describe, then compare and contrast each of the realizations. The purpose is not to rank the effectiveness of the innovation across settings, nor is it to iden tify problem cases that must be discarded or analyzed separately, as they would need to be in the standard framework. Instead, the variations become the ob jects of study.
In the chaptcr on purpose, for cxample, we discussed alternate realizatiol-s of purposeful writing using MAILBAG. We saw how teachers' pedagogical goals led to different amounts and types of in-class message writing. There were also significant variations due to the presence of other innovations, class room management issues, and students' own goals.
We thus focus on differences in use. This leads us to identify differential aspects of the settings that lead to different uses. Variation in use may suggest a functional relationship. We saw in chapter 6, for example, that the degree to which students' writing showed attention to real audiences and purposes renected their teacher's philosophy about the importance of that attention. In classrooms in which the teacher focused on activities involving real communi cation, there were abundant examples of attention being paid to purpose and audience in writing. For instance, the writing of the Holy Cross brochure (described in chapter 5) required attention to a particular type of audience: outsiders who wrote for information about Holy Cross. In some other class rooms true communicative writing was rare. In one (non-Alaskan) classroom, we saw continuous use of QUILL with multiple writing assignments, but ev ery text had only one audience, the teacher. These variations in use lead us to characterize the effect of the innovation as a function of elements of the set ting in which it is used.
In some cases, one can identify an entire set of classroom practices as a separate realization. This makes sense when the practices are significantly differ ent from other classrooms on several dimensions, as when, for example, a change in topic is consistently associated with a change in student collabora tion patterns, a new role for the teacher, and new goals for a writing activity.
An example of this is the situated evaluation of Electronic Networks for In teraction (ENFI), an approach to teaching writing in which students commu nicate via terminals linked in a local area network (Bruce & Peyton, 1990). Although there is a consistent philosophy underlying ENFI and even a model for its use (at Gallaudet University), in a brief period it was realized in many differcnt ways. Sixteen distinct realizations of what some might refer to as "the same innovation" were identified. At one site students reenacted dramatic liter ature using the network. At another, a professor used it for a Socratic tutoring approach to develop thinking skills. At yet another site it was used to support aspects of the writing process such as brainstorming and peer critiques. Each realization formed a coherent whole and arose from identifiable elements with in the setting of use. The status of the realization was evident in the discourse of students and teachers, in the physical layout of the room, in the types of activities students engaged in, as well as in observational and interview data.
Changes in the Innovation. A situated evaluation should make it easier to describe not only differences across settings, but differences across time as the innovation changes. We have seen such changes in QUILL and in many other innovations. Change is a normal part of the process of implementation, a process described by Berman and McLaughlin (1975) as a mutual adaption between an innovation and its social setting. The adaption can be to any aspect of the innovation-its technological apparatus, the procedures for its use, or the sup port system. Even the underlying theory may be revised.
The focus in situated evaluation is on the setting as a complex, historically and culturally defined system, in which the innovation is one element. Thus, differences in versions of the innovation do not shape the design of the evalua tion, but simply provide more variation to study. As a result, it is more feasi ble to compare and contrast cases of classrooms that use prototype versions of the innovation with those that use advanced versions, than it would be with summative evaluation, which is built on assumptions of a single entity being evaluated.
Part of this analysis is to examine how new users conceptualize the innova tion. Such an examination bears some resemblance to formative evaluation. ~ut in situated evaluation there would be no assumption that a particular set ting of use was typical. Thus, the purpose would be to understand the varie ties of actual use, not to identify a list of changes to the innovation.
The situated evaluation view of changes to an innovation encompasses more than changes to the technology, even defined to include the underlying theory and support system. Because thc innovation does not unilaterally alter social practices but, rather, becomes incorporated into them, we need to examine the extended process of change in these practices. Thus the analysis avoids the untenable position that measures changes during one interval of an ongoing process to account for the effect of an innovation. For example, QUILL teachers sometimes added collaboration to their classrooms first by having one student read a piece for an author while he or she typed it, acting as a "computer help er." The reader would often also watch the screen to catch typing errors and would start to make comments on the piece itself. Later in the year, the teacher might make collaboration more formal by, for example; having teams of stu dents work on articles for a class newspaper.
The experience with the QUILL teachers' network is another example of this change process (chapter 7). It represents an ongoing creation of an inno vation as part of evolving social practices.
A key difference between situated evaluation and the standard frameworks is that its purpose is to learn first how the innovation is used, not how it ought to be changed or whether it has the claimed effects. Because it is concerned with actual use, it does not focus on the innovation or its effects, but rather on the social practices within the settings in which the innovation is re-created. This shift in focus has implications for the audience of the evaluation, the role of setting variability, the tools for evaluation, the time of assessment, and the presentation of results. For example, the goal of understanding the innovation in-use leads to an emphasis on contrasts between uses rather than constancy. We can now summarize the discussion of situated evaluation by comparing it with the traditional types of evaluation:
TABLE 8.6
Comparisons Among the Three Types or Evaluation
________________________________________________________________________________________________ Formative Summative Situated Focus Innovation Effects of the innovation Social practives Audience Developer User user(but also developer) purpose Improve the inno- decide whether to adopt Learn how the innova- vation innovation tion is used Variability of Minimized to high- Controlled by balanced Needed for contrastive Settings light technology design or random analysis sampling Measurement Observation/ Experiment Observation /interview Tools interview/survey Time of ass- During development After initial development During and after essment development Results List of changes to the Table of measures con- Ethnography technology trasting groups _________________________________________________________________________________________________
Focus. Recall the Walberg and Haertel (1990, p. xvii) definition, that evaluation is an "examination of an educational curriculum, program, institution organizational variable, or policy. The primary purpose . . . is to learn about the particular entity studied.... The focus is on understanding and improving the thing evaluated (formative evaluation), on summarizing, describing, or judging its ... outcomes (summative evaluation) ..." [italics added]. It is clear from the wording, and from most of the work on evaluation, that standard evaluation is concerned either with properties of the innovation alone or with its "effects." In contrast, situated evaluation focuses on the way the innova tion becomes social practices.
Audience. Situated evaluation results can be used by both users and de velopers. Users can make decisions not only about whether to use the innova tion, but how to use it in their particular context. Developers can learn how to revise the innovation taking into account the variations in use.
Purpose. For situated evaluation, the audience is broad, as are the actions. The results could lead to developers changing the innovation, to users chang ing their practices, to adoption of only parts of the innovation, or to deeper understanding of the process of use.
Variability of Settings. The central concern for situated evaluation is with characterizing the way an innovation comes into being in different contexts. Because the audience for the evaluation wants to know how to improve the use of the innovation, it is important to have a variety of contexts that they can compare to their own setting or to ones they might create. Thus, there is a need for a uariety of contexts of use, or differences across settings. This is one reason why "situated evaluation" is not equivalent to "qualitative evalu ation. " Often, qualitative research is applied to emphasize common patterns and to dismiss idiosyncratic results. With situated evaluation it is important to capture the idiosyncrasies and to understand their origins.
Measurement Tools. With situated evaluation, the emphasis is on differ ences across contexts. This emphasis implies the use of qualitative tools, in cluding observations and interviews that are structured to elicit information about recurring social practices in the setting and to draw out differences among realizations.
Time of Assessment. Situated evaluation can start once the innovation is developed enough to be placed in a classroom. This is in contrast to.forma tive evaluation, which might start even earlier, in a laboratory setting. Situat ed evaluation can continue well after the developers have finished. It could be done before summative evaluation as a way to identify sites or issues to study, or afterwards as a way to study the process of change.
Results. Because a situated evaluation seeks to characterize alternate reali zations, it requires multiple, detailed descriptions of specific uses. Changes need to be described using appropriate quantitative or qualitative representations, but more importantly, the reasons for changes need to be discussed and linked to characteristics of the settings of use. The process of change, including changes in the innovation, in the users, and in the setting, becomes paramount. For these reasons, narrative accounts of diverse uses are most useful. Thus, chapters 5, 6, and 7 are essentially stories of the QUILL experience.
The many realizations of an innovation reflect properties of the innovation-in use, properties that emerge only in practice. These properties may seem ephemeral, based as they are on particularities of settings, but they are the only ones that matter for evaluation, for redesign of the innovation, for select ing appropriate settings of use, or for predicting future results of use. Thc cx amples in this book show the power of the social context to affect the ultimate uscs of a new technology. How the features of the technology interact with human needs, expectations, beliefs, prior practices, and alternative tools far outweighs the properties of the technology itself. Thus, when we analyze the effects of an innovation, we must consider much more than an aggregate result such as the "average impact of the typical implementation."
We see situated evaluation as a new framework for understanding innova tion and change. This framework has several key ingredients. It emphasizes contrastive analysis and seeks to explore differences in use. It assumes that the object of study is neither the innovation alone nor its effects, but rather the realization of the innovation-the innovation-in-use. Finally, it produces hypotheses supported by detailed analyses of actual practices. These hypotheses make possible informed plans for use and change of innovations.