The Importance of Effective Testing and Assessment in Education

Apr 20, 2023 | 0 comments

Introduction

The paper details developing a writing test and evaluating its validity, reliability, and practicality. This is a group work of partners of a designed short test that is relevant to ELI context on writing skills. The test has been evaluated in the light of principles of language assessment. The paper first reviewed the literature on how to assess the writing skills and techniques of testing. Furthermore, the literature review examined the principles of language assessment which included content validity, construct validity, inter-rater reliability, and practicality. Additionally, the literature on test specifications was reviewed and how they are related to validity, reliability, and practicality before discussing the models and formats of test specifications. The paper chose Bachman and Palmer’s format as the test specification model. The developed writing test in the appendix has been analyzed by discussing its test specifications according to Bachman and Palmer’s format. Finally, the paper discussed the test evaluation methods about the content and construct validity, inter-rater reliability, and practicality evaluation. A writing test developed on test specifications is valid, reliable, and practical.

Literature Review

Assessing Writing Skill

Continuous assessment of the writing skills is vital to the effective teaching of writing skills to the students. Assessments based on curriculum can be used in assessing the process and product of writing and should also consider the purpose as well. According to Diane (1990), the writing skills can be assessed through self-observational and observational checklists. However, the writing products can be evaluated based on five product factors which include content, fluency, conventions, vocabulary, and syntax. Douglas (2010) asserted that samples of writing should also be assessed across different writing purposes to give a holistic picture of the writing performance of the students across different text genres and structures. These simple measures can fulfill different assessment functions including planning instruction, identifying the strengths and weaknesses, giving feedback, evaluating instructional activities, reporting progress, and monitoring performance.

The first responsibility of the teacher is to provide writing opportunities and encouragements to the students who are attempting to write. The second responsibility of the teacher is promotion of success in students’ writings. These are done by teachers through careful monitoring of student writings to assess their weaknesses and strengths, teaching specific strategies and skills of writings in response to the needs of the students, and finally giving feedback that reinforces skills newly learned and correct the recurring problems. These responsibilities of the teachers upon inspection reveal that the assessment of writing skills is a vital part of good classroom instruction.

Airasian (1996) in his study identified three types of assessments in the classroom for writing skills. The first was referred to as “sizing up” assessments, which is usually conducted during the first week in school to give the teacher quick information about the level of the writing skills of the students before beginning their instruction. Instructional assessment, which is the second type, is applied during the daily tasks of planning for the classroom instruction, monitoring the progress of the students and, giving feedback. The third type Airasian (1996) referred to as official assessments which are the formal assessment functions that are periodic for grading, grouping, and reporting. In summary, teachers apply assessment methods in identifying the strengths and weaknesses of the students, for instruction planning to fit the needs that have been diagnosed, giving feedback, evaluating instructional activities, reporting on the progress, and monitoring the performance.

Test Technique

According to Fulcher (2010), assessment of writing skills based on the curriculum must start with curriculum inspection. Many curricula of writing are based on conceptual models that encompass purpose, process, and product. Therefore, the conceptual model forms the framework for the assessment techniques

Fultcher and Davison (2007) pointed out that diagnostic uses of the assessment that is determining the instructional needs of the students and the reasons for their writing problems are best met by examining the writing process. For instance, the strategies students use and the steps they go through as they work on their writing. Does the student have a strategy for organizing his or her ideas? How much planning is done by the student before writing? What are the possible obstacles of getting the thoughts to the paper? Does the student go through or related what they have written? How does the student attempt spelling words they do not know? Does the student share or talk with others as they write? What type of changes does the student make on his or her first written draft?

Hughes (2003) indicated that to make relevant instructional observations, the observer must use a conceptual model of how the process of writing should be done. The educators have not reached any consensus regarding the number of steps in the process of writing. Elbow (1981) proposed two steps while Frank (1979) proposed nine. On the other hand, Englert et al (1991) provided a five-step model for the process of wring using the acronym of POWER which stands for Plan, Organize, Write, Edit, and Revise. Additionally, each step has its own strategies and sub-steps that progressively become sophisticated as the students mature up in writing. Observation can also be used as a test technique in assessing the process of writing of the students as they progress through the writing steps.

According to Messick (1989), having the student assessing their own process of writing as a test technique is also advantageous for two reasons. The first advantage of self-assessment as a test technique is that it offers an opportunity for the students to observe and make a reflection on their approach, hence drawing attention to significant steps that they may have overlooked. The second advantage of self-assessment that follows a conceptual model like the POWER model is that it provides a means of internalizingan explicit strategy, and this allows the student to rehearse mentally on the steps of the strategy

Principles of Language Assessment

Content Validity

Validity is the extent to which the inferences gotten from the results of the assessments are meaningful, appropriate and useful in terms of assessment purpose (Gronlund, 1998, p.226). According to Hughes (2003) and Mousavi (2002), if a test samples the actual subject matter from which conclusions are to be drawn and also if the test requires the takers of the test to perform the measured behavior, then it can claim content validity. Nation (1993) asserted that an individual can usually identify content validity through observation if they can clearly define the achievement they are measuring. A test administered in written form and requires students to read a passage then write down their responses makes it high in content validity for a writing class

Brown (2010) stated that another way to understand content validity is by considering the difference between indirect and direct testing. In direct testing, the test taker actually performs the task targeted, while in indirect testing, the learners do not perform the task itself but instead perform a related task in some way. For instance, testing learners on oral production of the syllable stress and yet the tasks is having the learners mark the syllable stress with written accent marks in a list of the written words. Logically, the argument can be that the learners are indirectly tested on oral production of the stress syllables. However, the most feasible rule of achieving content validity is directly assessing performance

Construct Validity

Construct validity is a construct in a hypothesis, model, or theory attempting to explain phenomena observed in the people’s perceptions of the universe. Constructs may be or even may not be empirically or directly be measured since their verification often needs inferential data. Every issue virtually in the learning of language and teaching involves theoretical constructs. Brown (2010) observed that in the assessment field, construct validity asks whether the tests actually tap into the defined theoretical constructs. Davidson, Hudson and Lynch (1985) stated that in a manner of speaking, tests are operational constructs definition in that they operationalize the entity being measured.

For instance, suppose a teacher has a created a simply written vocabulary quiz, that covers the content of the recently completed unit. The test asks the students to define correctly a set of words, which are a perfectly adequate sample of everything that was covered in the entire unit. However, the unit’s lexical objective was the communicative use of the vocabularies. Therefore, certainly writing the definitions of the words fails to match the construct of communicative use of language.

Construct validity according to Brown (2010) is a major issue invalidation of the large-scale proficiency standardized tests. Since such test must adhere to the practicality principle, for economic reasons and also because they must sample limited areas of the language, these proficiency standardised tests may not be able to contain all of the content of a specific skill or field. For example, TOEFL has not attempted sampling oral production until recently, yet oral production is a significant part for success in academics in a course of study of a university. The omission of content oral production by TOEFL, however, has been justified ostensibly by research done by Duran et al (1985) that has indicated positive correlation between behaviours, writing, grammatical detection, reading and listening and oral production actually sampled from TOEFL. Because of the need of offering financially affordable proficiency tests and also of the high cost of scoring and administering tests on oral production, the omission of oral content from the proficiency TOEFL has been justified as a necessity economically.

Inter-Rater Reliability

According to Brown (2010), a reliable test is dependable and consistent. If you give the same examination to the student or even matched students on two occasions different, the examination should produce similar results. The matter of reliability of tests may be addressed best by making consideration to several factors that may contribute to a test’s unreliability. Mousavi (2002, p. 804) proposed the following possibilities: in scoring, fluctuations in the student, the test itself, and in test administration.

Student-Related reliability

According to Mousavi (2002, p. 804), the most common reliability issue that is related to learners is caused by fatigue, temporary illness, anxiety, bad day and other psychological and physical factors which make the “observed score” be deviated from the learners “true score.” Similar factors also included in the category include test-wiseness of the test takers or efficient test-taking strategies.

Rater Reliability

Bias, human error, and subjectivity may enter into the process of scoring. According to Mousavi (2002, p. 804), inter-rater reliability takes place when two or more scores produce inconsistent scores from the same test, possibly due to inexperience, lack of attention to the criteria of scoring, preconceived biases or even inattention.

The issues of rater reliability are not only limited to the context where two or more of the scores are involved. Mousavi (2002, p. 804) pointed out that intra-rater reliability is a very common occurrence for the classroom teachers due to the unclear criteria for scoring, bias towards particular bad and good students, fatigue, and even simple carelessness.

When a teacher or examiner is faced with many tests to grade in a very short period of time, however subliminally, the examiner knows the standards they will apply. The first few tests will be different from the last few tests. The examiner may be “harder” or “easier” on these first few papers or they may get tired, and the results, therefore, may be inconsistent across all the tests. Mousavi (2002, p. 804) suggested that the best solution to intra rater unreliability of the teachers is to read through approximately half of the tests before giving any final grades or scores, then recycle back again through the entire test for an even-handed judgment.

Brown (1991) indicated that in the tests for the skills in writing, rater reliability is specifically hard to achieve since proficiency in writing involves several traits that is also difficult to define. However, careful specification of the instrument for analytical scoring cab increases rater reliability

Test Administration Reliability

Mousavi (2002, p. 804) pointed out that unreliability may also come from the test administration conditions. For instance, an oral comprehension test administered through a tape recorder can be affected by street noise if the classroom is situated in a noisy environment, hence hindering accurate hearing by the students. Other unreliability sources are found from different amount of lighting in a classroom, variations in photocopying, the conditions of the chair and tests, and even variations in temperature

Test Reliability

The nature of the test itself sometimes can cause errors in measurement. If a test is too long, the takers of the test may be fatigued by the time they have reached the last items f the test and therefore incorrectly respond hastily. Similarly, Mousavi (2002, p. 804) asserted that tests that rare timed may discriminate against the students who do not perform at their best in tests that are timed. Additionally, test items that are poorly written for instance, they have more than one correct answer or even are ambiguous, may also be a further source of the unreliability of tests.

Practicality

According to Brown (2010), an effective test is a practical test. This implies that the test is not expensive excessively, is within the appropriate constraints of time, is easy to administer relatively, and finally has an evaluation/scoring procedure that is time-efficient and specific.

Brown (2010) indicated that a test that is prohibitively very expensive is very impractical. Similarly, a language proficiency test that takes five hours for the student to complete is also impractical since it consumes more money and time than necessary in accomplishing its objective. Furthermore, Brown (2010) stated that a test requiring individual one-on-one administration or proctoring is also impractical to a student group or hundreds of test-takers with only a few examiners. A test-taking a few minutes to take by the students but many hours for an examiner to do an evaluation is also impractical in a classroom situations. Similarly, a test that can only be scored by a computer is also impractical if the test is done miles away from a place with the nearest computer for scoring. The quality and value of a test sometimes hinge on these practical, nitty gritty considerations (Brown, 2010).

Test Specifications

Test specifications are the generative blueprints for the design of the test (Davidson and Lynch, 2002). Test specifications for use in classrooms can be a practical and simple outline of the tests, while the tests intended for large-scale distribution and use, the test specifications are much more detailed and formal

Models and formats

Davidson and Lynch Model

Davidson and Lynch (2002) stated that there is no single magic formula or best format for test specifications since there are many ways of designing one. They base their model on a model earlier developed by Popham (2008). Davidson and Lynch’s (2002) model has five components.

General description

This section of the test specification is the focus and object of the assessment indicating the skill and behavior to be tested. Also included in this section is usually the motivation or reason for testing and a statement of purpose.

Prompt Attributes

This is the second section of the model and it details what will be given to the taker of the test. The section includes information about test format, item selection, the actual item or form, and a detailed description of what the takers of the test will be asked to perform. Moreover, this section includes instructions and directions the test taker will read.

Response Attributes

This section details how the test taker will respond to the task or the item

Sample Item

The purpose of this section is to make the language of the general description, prompt attributes, and response attributes lively (Davidson and Lynch, 2002, p. 26).

Specification Supplements

The design of this section allows the specification to include as much information and detail as possible without making unwieldy the language of the general description, prompt attributes, and response attributes.

Alderson, Clapham and Wall Model

This test specification as recommended by Alderson et al (1995) should vary in content and format according to their audience according to the audience. Alderson et al (1995) propose various specification documents for test users, test validators, and test writers. In a test specification developed for writers, Alderson et al (1995) propose the inclusion of information on the following areas: General purpose statement, test focus, test battery, sources of texts, test tasks, item types and rubrics.

Similarly, Alderson et al (1995) recommended that specification documents for the validators of tests should include information on constructs that are assessed and the ability of language these constructs are based on. Specifications for test users probably should be written in the language of the lower level and include important information for the taker of the test. The authors recommended that specifications for the test users contain information like the test’s statement of purpose, complete tests, or sample items to review, and a description of the performance expected at key levels.

Bachman and Palmer Model

Their specification has two parts, test task specifications and the structure of the test. In the structure of the test part, Bachman and Palmer (1996) advise including information like several subtests, their relative importance, and order as well as the number of tasks/items per part. On the section of test task specifications, it has the following components: purpose, definition of the construct, setting, time allotment, instructions, input characteristics and the expected response and the scoring methods.

The Bachman and Palmer model has been choosing because of several reasons. First, its two component parts which make it inclusive and detailed compared to other models. Secondly, Bachman and Palmer model represents reorganization and relabeling of the Davidson and Lynch Model, hence an improvement to it. Therefore, the model has been chosen because of its advanced specification and inclusive nature.

Developing A Writing Test

Reverse engineering

There is a ready-made test (See Appendix 1) and therefor this section will simply detail its specification according to the format of Bachman & Palmer format. Reverse engineering according to Miller (2011) is the creation of specification document from a set of test items or test that already exists. Davidson and Lynch (2002, p. 41) stated that not all testing is specification driven. They further defined tests that are specific driven as those which have been created from a specification. Additionally, much of the investment at the setting of the test is in the creation, maintenance,, and evolution of the specs and also the tests themselves. In many institutions, testing programs and tests exists without usage of the test speculations that are formalized. In these scenarios, the developers rely on history and institutional memory in constructing their tests or even intuitions on what will create an effective test.

The reverse engineering process can be a useful one in situations when a school wants to move from a situation of non-spec driven testing to a spec driven situation (Davidson and Lynch, 2002, p. 43). Similarly, this process can help in clarifying whether a particular test is spec-driven, improve specifications, create a spec for tests, and even critique the specifications existing.

Test Specifications:

Bachman & Palmer’s format

The specifications of the test in appendix 1 will be done according to Bachman and Palmer’s format. The model adopted the five components of the Popham model but under different labeling.

Purpose

This is the statement of how the task/test item should be used. The purpose of the text is explicitly clear that it is a writing exam

Definition of construct

This is the detailed description of a particular language ability aspect being tested, or construct. This also includes inferences which can be made from the scores of the test that overlaps with the purpose of the test. In the test, the construct being defined is writing ability. The aspects of writing that will be tested include paragraph structure and length, content and cohesion, lexical range, grammar and mechanics, and sentence completeness

Setting

This is the listing of the characteristics for the setting where the test will be done, such as time of administration, participants, and physical location. The setting for the test is unclear since no physical location has been indicated, the participants and the time the test will start and end is also not indicated

Time allotment

This is the allowed amount of time for completing specific task or set of items on the test. The time allotment for the test is 50 minutes

Instructions

This is the listing of the language that should be used in the directions to takers of the test for that particular task/item. The instructions for the test to the test takers is for them to choose one of the given jobs (Businessman/businesswoman or a pilot), and then write a paragraph of at least 40 words describing the job and to say why they want or do not want to have the job in future.

Characteristics of the input and the response expected

This is the description of what will be presented to the takers of the test and what will be expected of them to do with it. The characteristics of the input in the test is to select one of the jobs and then write a paragraph of 40 word describing it.

The response expected is a detailed description of the selected job and the reasons why they want or even do not want to have it in the future in 40 words

Scoring methods

This is the description of how the response of the test taker will be evaluated (See Appendix 2) and also the results in Appendix 3. The scoring methods will include the following:

Paragraph structure and length- whether the paragraphs of the test taker consisted of a clearly organized topic sentence, is supported in the body and has a concluding sentence. Moreover, it is approximately 50 words
Content and cohesion- whether the content is relevant and is focussed showing full understanding of the topic with no digression. Moreover, the learner properly links his sentences with linking words
lexical range- the test taker used a wide range of relevant appropriate lexical items with no confusion
grammar and mechanics- whether the test taker accurately and effectively use grammatical mechanics and structures with minimal errors intense choice, spelling, subject-verb agreement, capitalizations and prepositions
sentence completeness- whether the test taker does not produce no incomplete or garmented sentences

Test Evaluation Methods

The principles of language assessment provide useful guidelines for both evaluation of the existing assessment procedures and also in designing one. Tests, quizzes and final exams can also be evaluated through the lenses of these principles.

Content Validity Evaluation

Content validity evaluation is the extent to which evaluation requires students to perform the tasks which were included in the previous lessons and which represent directly the unit’s objective on which the evaluation Is based. Brown (2010) proposed two steps for content validity evaluation of a classroom test.

Are the objectives of the classroom appropriately defined and formed?

The objectives of the unit, module or lesson are what underlies every good classroom hence identification of the objectives is the first measure of effective tests. In most instances, teachers work with poorly framed objectives or with little or no cognizant of the objectives they are seeking to fulfill that determining whether or not the objectives were fulfilled is impossible.

Are the objectives of the lesson represented in the form of test specifications

The concept of test specifications on the lesson objectives means that the test should be having a structure that logically follows from the unit of the lesson being tested. Several tests have the design which divides them into several sections perhaps, corresponding to the objectives under evaluation; offers variety of item types to the students; and provides appropriate weight to each section. (Brown, 2010).

Construct Validity Evaluation

Construct validity evaluation is the extent to which evaluation requires teachers to evaluate tasks based constructs underlying the unit, lesson or module. This evaluation can be empirically be measured from the inferential data from the evaluation scores of the students. The teacher should ask himself or herself whether the tests and the evaluation involves the defined theoretical constructs of the lessons (Brown, 2010).

Inter-Rater Reliability Evaluation

Rater reliability is an issue commonly overlooked since classroom tests in most instances do not involve two scorers. Brown (2010) asserted that inter-rater reliability is not an issue in most cases but intra-reliability is of constant concern. The teachers need to find mechanisms of maintaining their stamina and concentration over the time it takes in scoring assessments.

Practicality Evaluation

Practicality evaluation is determined by the students ‘and the teachers’ time constraints, administrative details, and costs, and also to some extent what occurs after and before the test. The practicality evaluation checklist suggested by Brown (2010) is listed below:

Are the administrative details instituted before the test?
Can the test takers complete the test within the reasonable time frame set?
Can the test be smoothly be administered with glitches in the procedures?
Are the equipment and material ready?
Is the test costs within the limits of the budget
Is the evaluation/scoring system feasible in the timeframe of the teacher?
Are the mechanisms of reporting the results laid down in advance? (Weir and Cyril, 2005).

Table 1: Results of the Students

N. Student Name Rater 1 Mark Rater 2 Mark
1 Alyaa Alzahrani 7 7
2 Amjaad Alotaibi 7 7
3 Asalah Almotairi 6 6
4 Batool Alhabeeb 9 9
5 Boshraa Almalki 6 5
6 Danya Alaydaros 6 6
7 Hind Almoaalm 9 9
8 Johara Alzahrani 10 10
9 Lama Alfarran 5 5
10 Lamees Blal 7 7
11 Maha ??? 4 4
12 Manal Hazazi 7 7
13 Mashaael Alsolami 9 9
14 Mezneh Alshaikhi 4 3
15 Momainah Barood 4 4
16 Norah Khalaf 7 7
17 Rawan Alamodi 8 8
18 Rawan Alghamdi 8 7
19 Rehab Alzahrani 6 6
20 Sanaa Mohammad 9 9
21 Sondos Zaini 7 7

One-Sample T: Rater 1 Mark

Variable N Mean StDev SE Mean 95% CI

Rater 1 Mark 21 6.905 1.758 0.384 (6.105, 7.705)

Figure 1

From the above graph we can see that the rater 1 mark is normally distributed with a mean of 6.905 and variance to be 1.758 we use here t statistic to test the results the bell curve shows that the average mark of rater 1 is round about 7 and the variation is 1.76 it means that on average each student is taking marks of variation 1.75 from the above graph we can say that the rater 1 mark is valid but the performance level of the students is below the average.

Figure 2

The graph shows the inter reliability of the rater 1 and rater 2 from this we can see that alyaa alzahrani mark remain same for rater 1 and rater 2 that is and she is below level next amjad alotabi he shows same results like alyaa alzharani most of the people are below the level from this we can see that there are no variation in the rater mark 1 and rater mark 2 so the test is reliable as it does not impact on the mark of students.

The grading system used for the class is in table 2 below and the graphical representation in figure 3:

Table 2: Grading system

A 100 to 90
B 89 – 80
C 79 – 70
D 69 – 60
F 59 – and less

Figure 3

Conclusion

In conclusion, a writing test developed on test specifications is valid, reliable, and practical. The developed test was valid, reliable, and practical. The test was content valid since it tested the writing content of the lessons, furthermore, it was based on the underlying theoretical constructs of the lessons hence making it construct valid. The test was also inter-rater reliable because of the minimal differences between the findings of the marker and the cross checker. Lastly, it was practical because the time allocated was enough, the number of students was adequate, and had enough number of examiners. The paper believes that the test was properly tested, was organized, marked properly hence making it content and construct valid, reliable and practical.

References

Airasian, P. W. (1996). Assessment in the classroom. New York: McGraw-Hill.

Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge [England: Cambridge University Press.

Bachman, L. & Palmer, A. (1996). Language Testing in Practice. UK: Oxford University Press.

Brown, H. D. & Abeywickrama, P. (2010). Language assessment: Principles and classroom practices. New York: Pearson Education.

Davidson, F., & Lynch, B. K. (2002). Testcraft: A teacher’s guide to writing and using language test specifications. New Haven: Yale University Press.

Diane, T. (1990). ESL writing assessment: Subject-matter knowledge and its impact on performance. New York: Elsevier Ltd.

Douglas, D. (2010). Understanding language testing. UK: Hodder Education.

Elbow, P. (1981). Writing with power: Techniques for mastering the writing process. New York: Oxford University Press.

Englert, C. S., Raphael, T. E., Anderson, H. M. A. L. M., & Stevens, D. D. (January 01, 1991). Making Strategies and Self-Talk Visible: Writing Instruction in Regular and Special Education Classrooms. American Educational Research Journal, 28, 2, 337-372.

Fulcher, G. (2010). Practical language testing. UK: Hodder Education.

Fultcher, G. and Davison, F. (2007). Language testing and assessment: an advanced resource book. New York: Routledge.

Goldberg, N. (1986). Writing down the Bones: Freeing the writer within. Shambhala Publications.

Hughes, A. (2003). Testing for language teachers (2nd Ed). Cambridge: Cambridge University Press.

Messick, S. (1989). Validity. In R. L. Linn (Ed), Educational measurement, 13-103.

Miller, M.M. (2011) “The Spice of Writing: Extracurricular Projects for Technical Writers”. IPCC 92 Santa Fe. Crossing Frontiers. Conference Record. pp. 384–390.

Mousavi, A. (2002). Textbook trends in teaching language testing. Language Testing, 25, 3, 327-347.

Nation, P. (1993). Vocabulary size, growth, and use. The bilingual lexicon, 115-134.‏

Popham, W. J. (2008). Transformative assessment. Alexandria, VA: Association for Supervision and Curriculum Development.

Weir, Cyril, J. (2005). Language testing and validation: an evidence-based approach. Macmillan: Palgrave.

5/5 - (4 votes)

Herman Bailey

With a student-centered approach, I create engaging and informative blog posts that tackle relevant topics for students. My content aims to equip students with the knowledge and tools they need to succeed academically and beyond.