ESL EFL Essay Tests
Scoring criteria & Rating scale formats
The criteria used to judge the essay examination operationally
define which content features and test structure constitute a “good “or at least a “competent” response. To
be credible, criteria should not reflect the preferences of only a few individuals, but should represent standards
endorsed by a community of professionals knowledgeable about the subject matter.
Secondly the criteria should refer to these features of content
and written expression, which are amendable to instructional intervention. We cannot test what we do not
teach in the classroom. For example, dimensions of “depth”, “flavour”, and “creativity” may enhance the
quality of the essay but a growing number of educators contend that it is neither logical nor fair to hold the
learners accountable for subject matter or writing expertise that the schools cannot demonstrate they can
teach.
The criteria used to evaluate learners’ content and written
expression vary along a number of dimensions. The variation may be as follows:
From qualitative value judgements to quantitative counts of
information and test features;
From global reactions to analytical judgements;
From comprehensive attention to a range of concepts and text features to isolated focus on particular information
or text feature;
From vague guidelines to replicable precise definitions.
Generally, readers’ reactions to learners’ essays involve three levels of judgement.
1)
Subjective, global impressions of overall quality
2) Analytic judgements about component test features
3) A holistic quality judgement combining subjective impressions with
judgements about the quality of the combinations of text elements.
i. Global judgment
In general impression scoring, a rater reads an essay once and
assigns it a quality score. General impression ratings are global, heavily qualitative and are based upon
vague guidelines that may not refer to component text features or their differential weighting or
importance.
ii.
Analytic judgement
The most quantitative, detailed and replicable scales are
analytic rating scales where readers assign several scores for various features of the essay. Analytic scores
vary considerably in the range of content, rhetorical, structural and syntactic elements referenced and in the
relative weights of these elements. The analytic scores differ in the importance they give to different
features of written assignment.
S Mohanraj (1981) discusses analytical rating scales of
Caroll (1961), Alan & Campbell (1965), Cooper (1972), Davies (1977) and Pilliner. He has prepared a model
of his own which includes twelve features of writing. He has further simplified it and has arrived at a model
suited to our situation where teachers cannot spend much time in correcting compositions. This model is quite
practicable and easy to use.
A similar model is suggested by suggested by Rita M. Deyoe (1980). Her model gives more importance to
grammatical aspects whereas Mohanraj’s model attempts to concentrate on stylistic and discoursal
features.
iii. Holistic
judgement
Holistic scales, where readers assign a single score, often
combine characteristics of both general impression and analytic approaches. Holistic schemes vary widely in
the range of text elements contributing to each score point and the specificity with which score levels are defined
(Ingenkamp 1977, Quellmalz 1980).
Since the focus, specificity and objectivity of criteria
informing impressionistic, holistic or analytic approaches vary considerably, an examination programme should weigh
carefully the nature of the criteria selected and their underlying rationale. Otherwise the programme may
find that the criteria do not match well with the aims of the assessment and instructional programmes and do not
provide a useful status report or diagnostic feedback. The need for explicit criteria is also apparent for
scoring subject matter essay examinations. Learners commonly complain about the ambiguous subjective criteria
used for subject matter essay examinations in the classroom assessments. When results of large scale
achievement exams have serious consequences for learners’ explicit public and rational scoring keys are
imperative.
d. Rating
Procedures.
When a large number of papers must be scored by a pool of
readers, an assessment programme must ensure that evaluation criteria are uniformly interpreted and applied.
Such standardization involves both the formulation of explicit criteria and procedures for training raters.
In the US rater training follows a fairly standard procedure. The following steps are employed to train
raters.
$ There is
a brief introduction to the rating scale.
$ Then the
raters begin to practice applying criteria to a set of papers representing the test sample.
$ A trainer
leads a discussion of the features of each paper that result in the classification of the paper to a particular
grade.
Training time varies according to the number of separate scores
recorded for each paper and according to the clarity of the criteria. The rigor of the procedures used to
decide if acceptable rater agreements levels have been attained at the end of the training vary from a show of
hands to pilot tests requiring independent scoring of essays.
In India through essay examinations are widely used, there is no
programme to train raters. Failure to conduct any structured training or to check on prior agreement levels
may increase the risk of unreliable scoring.
e.
Reliability
The reliability of an examination programme depends on the degree
to which it eliminates measurement error. Four potential sources of error or score fluctuations identified
for examinations of writing ability (but applying as well to tests of subjects matter skills) are as
follows:
$ The
writer – within – subject individual differences.
$ The
assignment variations in item or task content.
$ Between -
rater fluctuations
$ Within -
rater instability
The writer within – subject errors can be avoided if the learners
are asked to write a series of essays instead of one single essay. Thus the reliability of learners’
performance can be determined by gathering data on a pool of homogeneous items or assignments. Since essay
writing requires at least twenty or thirty minutes it is often difficult to have them write many essays in
examinations. But studies of the consistency of learners’ performances across a series of essay often report
low reliabilities for a single essay. According to Spencer (1979) analysis of the stability of learners
writing performance across several essays is also not reliable because of the variability brought in by the
difference in topics.
Some ways of overcoming the problem of reliability are as
follows:
$ Essay
tasks should be based on specific skills of writing. This would reduce error variance due to the
assignment.
$ Essays
should be collected on at least two parallel assignments. This would reduce error associated with individual
variability.
$ Scores on
several essays should be combined to increase the readability of subject matter essay examinations.
Inter-rater agreement is the most prevalent issue concerning
reliability in essay examinations. Statistical indices of agreement levels include co-efficient alpha,
generalisability co-efficient, point biserial correlations and simple percentages of agreement. The most
effective method of reducing inter-rater variability is to provide training on clearly specified criteria. To
reduce error due to within – rater score fluctuations over time (rater drift) due to reader fatigue and / or
carelessness, some form of interspersed check procedure seems helpful, according to Quellmalz (1980).
Although some studies report that readers tend to get more lenient or harsher as rating progresses, few assessment
programmes routinely monitor this problem.
Mike Hayhoe (1983) in his article, ‘A Historical Review of Essay
Marketing’ discusses the problem of reliability in marking essays. According to him this problem has been
persistent for a long time in the history of marking essays. If Rowntree was concerned about marker
reliability in the 1880’s, Raleigh (1980) is equally worried about the same problem. Mike Hayhoe says
that an error of twenty five percent is grading an essay may be conservative estimate and it has been suggested
that the problem of unreliability in markings essays exists in internal assessment as well as external.
Reliability is inextricably linked with validity. The
reliability of an essay examination depends on how valid the examination is and how valid the markers are in their
assessment. A brief consideration of the problems faced by examiners in designing valid examinations is
necessary if one wants to integrate testing and instruction.
f.
Validity
The validity of an examination derives from evidence that the
test accurately and dependably measure the specified skills. Evidence for the validity of an examination may
take several forms.
i. One form
focuses on the test content, that is, the test items or essay assignment, and gathers judgement of subject matter
experts regarding a number of things like -
$
The objectives or skills defined to be important and representative of subject matter competencies, and
$
The way these skills are elicited in the item, problem or writing assignments.
ii. Other forms
of validity focus on test performance to examine the following things:
$
Concurrent validity – whether the scores are comparable to scores on other tests of the same skills,
$
Predictive validity – if the score levels predict future success, and
$
Construct validity – if the performance pattern appears to measure the underlying trait.
The most common methods of attempting to establish the validity
of essay examinations have been comparisons of scores to ‘related’ measures. In the case of tests of writing
ability, the ‘other’ measures chosen as criterion variables are often reading tests, multiple choice or class
grades.
The heart of the validity of a test is whether it measures the
underlying skill construct, that is, whether it taps the hypothetical mental store of information and
strategies. According to Raleigh (1980) the validity of an examination can be described in terms of the
degree to which it ‘measures well’ what it is intended to measure. According to Mike Hayhoe (1983) there is a
possibility to think of ‘Markers’ validity that is the degree to which he ‘measures well’ what the assessment
systems sets out to measure.
g.
Factors affecting marking
The marks awarded to an essay depend on a number of things.
For example, Thorndike (1986) discusses the problem of ‘uniqueness’. Uniqueness raises the issue of
divergence, the individuality of the work, and convergence, notions of correctness and orderliness. How far a
marker is affected by divergence and convergence will decide the marks he gives to a particular assignment.
Wiseman and Wrigley (1958) identified two schools of thought as far as assessors’ value base are concerned.
One school values ‘imponderables’ of validity, freshness and fluency. The second school of thought sees the
writer as ‘a craftsman able to show his skill whatever type of materials he works in’.
Britton (1963) found some evidence to suggest that teachers may
well group towards valuing one end or the other of the following two poles:
Sophisticated, conventional written based work
Work based on familiar speech
Work based on imagination including fantasy / the unreal
Work based on observation of real life
A number of studies conducted in America suggest that teachers
tend to clusters in favouring certain criteria ideas, form, flavour, mechanic, wording - and that the clusters of
criteria adopted by the teacher can affect grading.
Deale (1975) feels that ‘adequacy’ of writing rather than ideas
affects the marks awarded. Soloff (1973) argues that lack of consonance between the writers’ values and those
of the assessor on a topic may affect the grade awarded. The London Association for the teaching of English
shares his opinion. In its pamphlet, Assessing Compositions (1965) it expresses concern about how an assessor
may react to experiences and attitudes in an essay which are unfamiliar to him and the potential for under or over
assessing the work.
Marshall (!960) suggests that assessment in terms of the features of pieces of work which ‘float’ to the examiner –
his intuitions about the texts – is the proper activity of an alert and sensitive marker.
Markers can be affected by visual features at the expense of such
aspects as organisation, fluency, appropriateness in terms of task, audience and so on. According to Mike Haydoe
(1983) this may be because the visual features are more immediately obvious, especially when they are flawed, and
because there is a greater degree of consensus about them than there is about what ‘coherence’ or ‘clarity’ or
other more global criteria may be.
Marshall (1967) and Scannel (1966) have found assessors
particularly adversely affected by spelling errors, with errors of grammar and punctuation coming next. Handwriting
also has a great impact on the assessors and many researchers like Chase (1968), Briggs (1970), and Soloff (1973)
have demonstrated the power of this feature in affecting marking. In his more recent work Briggs (1980) goes
further, suggesting that there may be borderline areas in grading in which this value aspect of a piece of writing
may be the major factor in deciding what it is worth.
Yates and Pidgeon (1957) found that the setting of an essay
affected the markers’ response. If an ‘average’ piece of work followed several fine pieces, it was likely to
be marked hard; if it followed several poor ones it was likely to be upgraded.
The analysis of the present situation in Gujarat also reveals the
fact that teachers are more concerned with spelling errors and punctuation. Next comes the grammatical
error. Though all the teachers marked a number of features in the questionnaire (appropriacy, organisation,
overall writing ability etc.) as very important, all of them assign one single grade on the basis of the overall
impression of the composition.
i)
Drawbacks of Essays Examinations
Essay examinations are said to test learners’ ability to engage
in disciplined thought and the ability to express it in a coherent, supported discourse. But a number of
points need to be taken into account if essay examinations are used to measure writing ability.
Some of the problems involved in using essay type tests are as
follows:
It is difficult for an average teacher to structure such prompts
for essay tests that clearly specify the aim, topic, audience, writer’s role and evaluation criteria. The
problems of reliability, validity and the factors that affect marking discussed in this section prove that it is
very difficult to measure the skills of writing ability through essay examinations.
Teachers cannot spend a lot of time in checking essays using
analytic or holistic rating scales. The general impression score usually assigned by teachers is not a
reliable method of scoring.
The method of training of raters is expensive and time-consuming
and is not practicable as far as the schoolteachers are concerned.
Since it is not easy to structure, administer and score essay
examinations, we need to consider other types of tests which are easy to construct, are easy to evaluate and which
give a reliable and valid indication of learners’ proficiency to communicate through writing.
|