As the Result of a Sensitivity Review Items Containing __________ May Be Eliminated From a Test

  • Journal List
  • Am J Pharm Educ
  • five.83(7); 2019 Sep
  • PMC6788158

Am J Pharm Educ. 2019 Sep; 83(vii): 7204.

All-time Practices Related to Examination Detail Construction and Mail service-hoc Review

Michael J. Rudolph, PhD, corresponding author a Kimberly K. Daugherty, PharmD,b Mary Elizabeth Ray, PharmD,c Veronica P. Shuford, MEd,d Lisa Lebovitz, JD,e and Margarita V. DiVall, PharmDf,, g

Michael J. Rudolph

a University of Kentucky, Lexington, Kentucky

Kimberly K. Daugherty

b Sullivan University Higher of Pharmacy, Louisville, Kentucky

Mary Elizabeth Ray

c The University of Iowa College of Pharmacy, Iowa City, Iowa

Veronica P. Shuford

d Virginia Commonwealth University School of Chemist's shop, Richmond, Virginia

Lisa Lebovitz

e Academy of Maryland School of Pharmacy, Baltimore, Maryland

Margarita V. DiVall

f Northeastern University School of Chemist's, Boston, Massachusetts

g Editorial Board Member, American Journal of Pharmaceutical Education, Arlington, Virginia

Received 2018 Jun 11; Accepted 2019 Feb eighteen.

Abstract

Objective. To provide a practical guide to examination item writing, particular statistics, and score adjustment for employ by pharmacy and other health professions educators.

Findings. Each examination item type possesses advantages and disadvantages. Whereas selected response items allow for efficient assessment of pupil recall and understanding of content, constructed response items appear improve suited for cess of college levels of Blossom's taxonomy. Although articulate criteria accept non been established, accepted ranges for item statistics and test reliability have been identified. Existing literature provides guidance on when instructors should consider revising or removing items from future examinations based on detail statistics and review, just limited information is available on performing score adjustments.

Summary. Instructors should select detail types that align with the intended learning objectives to exist measured on the examination. Ideally, an exam will consist of multiple item types to capitalize on the advantages and limit the effects of any disadvantages associated with a specific detail format. Score adjustments should be performed judiciously and by because all bachelor particular data. Colleges and schools should consider developing detail writing and score adjustment guidelines to promote consistency.

Keywords: test, all-time practice, detail blazon, item analysis, score adjustment

INTRODUCTION

The main goal of cess via test is to accurately measure out student achievement of desired knowledge and competencies, which are generally articulated through learning objectives.1,ii For students, locally adult examinations convey educational concepts and topics deemed important by faculty members, which allows students to collaborate with those concepts and receive feedback on the extent to which they have mastered the material.3,4 For faculty members, results provide valuable insight into how students are thinking nearly concepts, aid with identifying student misconceptions, and frequently serve as the basis for assigning course grades. Furthermore, examinations allow faculty members to evaluate student accomplishment of learning objectives to make informed decisions regarding the future use and revision of instructional modalities.5,six

Written examinations may be effective assessment tools if designed to mensurate student achievement of the desired competencies in an effective way. Quality items (questions) are necessary for an examination to have reliability and to depict valid conclusions from the resulting scores.seven,viii Broadly defined, reliability refers to the extent to which an examination or some other assessment leads to consistent and reproducible results, and validity pertains to whether the test score provides an authentic measure out of student achievement for the intended construct (eg, noesis or skill domain).9,x However, evolution of quality examination items, notably multiple option, can exist challenging; existing evidence suggests that a sizeable proportion of items within course-based examinations contain one or more than flaws.7,11 While in that location are numerous published resources regarding exam and item evolution, virtually appear to exist aimed towards those with considerable expertise or pregnant interest in the field of study, such as scholars in educational psychology or related disciplines.2,eight-11 Our goal in authoring this manuscript was to provide an accessible primer on test item evolution for pharmacy and other health professions kinesthesia members. As such, this commentary discusses published best practices and guidelines for exam item development, including different particular types and the advantages and disadvantages of each, item analysis for particular improvement, and best practices for examination score adjustments. A thorough word of overarching concepts and principles related to examination content development, administration, and student feedback is contained in the companion commentary article, "Best Practices on Examination Construction, Administration, and Feedback."12

General Considerations Before Writing Examination Items

Planning is essential to the evolution of a well-designed examination. Earlier writing examination items, faculty members should first consider the purpose of the examination (eg, determinative or summative cess) and the learning objectives to be assessed. Ane systematic approach is the creation of a detailed design that outlines the desired content and skills to be assessed besides as the representation and intended level(south) of pupil cognition for each.12 This will help to determine not just the content and number of items only also the types of items that will be nigh appropriate.thirteen,14 Moreover, it is important to consider the level of student experience with desired particular formats, as this can impact performance.15 Students should be able to demonstrate what they accept learned, and performance should non be predicated upon their power to understand how to consummate each item.16 A educatee should be given formative opportunities to gain practice and experience with various detail formats before encountering them on summative examinations, This will enable students to self-identify whatsoever test-taking deficiencies and could help to reduce test anxiety.17 Table 1 contains several recommendations for writing quality items and avoiding technical flaws.

Table 1.

Guidelines That Reflect All-time Practices for Writing Quality Examination Items21-24

An external file that holds a picture, illustration, etc.  Object name is ajpe7204-t1.jpg

One of the most important principles when writing examination items is to focus on essential concepts. Test items should assess the learning objectives and overarching concepts of the lesson, and test in a manner that is in accordance with how students will ultimately utilize the information.18 Avoid testing on, or calculation, picayune information to items such every bit dates or prevalence statistics, which can cause construct-irrelevant variance in the examination scores (discussed later in this manuscript). Similarly, considering students advisedly read and analyze examination items, superfluous information diverts time and attention from thoughtful analysis and can cause frustration when students discover they could have answered the item without reading the boosted content.four,5,19 A clear exception to this prohibition on extraneous information relates to items that are intended to assess the student's ability to parse out relevant information in order to provide or select the correct answer, every bit is done ofttimes with patient care scenarios. Still, faculty members should be cognizant of the amount of time it takes for students to read and answer complex issues and go on the overall amount of information on an examination manageable for reading, analysis, and completion.

Each item should test a single construct then that the knowledge or skill deficiency is identifiable if a student answers an item incorrectly. Additionally, each item should focus on an contained topic and multiple items should not be "hinged" together.iv Hinged items are interdependent, such that pupil performance on the entire item set is linked to accuracy on each one. This may occur with a patient care scenario that reflects a existent-life situation such as performing a series of dosing calculations. Nevertheless, this arroyo does not assess whether an initial mistake and subsequent errors resulted from a true lack of understanding of each step or occurred just because a single mistake was propagated throughout the remaining steps. A more constructive way to assess this multi-step process would exist to take them work through all steps and provide a terminal respond (with or without showing their work) equally part of a single question, or to present them with independent items that assess each footstep separately.

Examination Item Types

There is a variety of particular types developed for utilise within a written examination, generally classified as selected response format, where students are provided a list of possible answers, and constructed response format that require students to supply the reply.4,19 Common selected-response formats include multiple choice (true/imitation, single best answer, multiple answer, and M-blazon), matching, and hot spots. Constructed response formats consist of backup-the-blank, brusk answer, and essay/open response. Each format assesses knowledge or skills in a unique way and has distinct advantages and disadvantages, which are summarized in Tabular array 2.20,21

Table 2.

Advantages and Disadvantages of Examination Question Types10,11,20,21,43

An external file that holds a picture, illustration, etc.  Object name is ajpe7204-t2.jpg

The most commonly used item format for written examinations is the multiple-choice question (MCQ), which includes truthful-false (alternative-choice), one-best answer (ie, standard MCQ), and multiple correct answer items (eg, select all that use, K-type).4,19 True-false items ask the examinee to make a judgment about a statement, and are typically used to appraise recall and comprehension.4 Each reply choice must be completely truthful or false and should but test one dimension, which tin can be deceptively challenging to write. Flawed truthful-simulated items tin go out an examinee guessing at what the item writer intended to inquire. Faculty members should not be tempted to employ truthful-false extensively every bit a ways of increasing the number of examination items to cover more content or to limit the fourth dimension needed for examination development. Although an examinee can respond true-imitation items quickly and scoring is straightforward, there is a l% chance that an examinee can simply guess the right respond, which leads to low item reliability and overall test reliability. Not surprisingly, true-false questions are the most commonly discarded type of particular later review of item statistics for standardized examinations.4 Though it may take additional time to grade, a way to employ true-false items that requires higher-club thinking is to have the examinee place, fix, or explicate any statements deemed "false" equally part of the question.22

Ane-best-reply items (traditional MCQ) are the most versatile of all test item types every bit they can appraise the examination taker's awarding, integration, and synthesis of knowledge as well as judgment.23 In terms of blueprint, these items comprise a stalk and a lead-in followed past a series of answer choices, simply ane of which is correct and the other incorrect options serve as distractors. Audio assessment practice for i-all-time-answer MCQs include: using a focused lead-in, making sure all choices chronicle to one construct, and avoiding vague terms. A simple means of determining whether a pb-in is focused is to use the "cover-the-options" rule: the examinee should be able to read the stem and atomic number 82-in, cover the options, and be able to supply the correct answer without seeing the answer choices.4 The stalk should typically be in the form of a positive or affirmative question or statement, as opposed to a negative one (eg, one that uses a give-and-take similar "not," "fake," or "except"). Even so, negative items may exist appropriate in certain situations, such as when assessing whether the examinee knows what not to practice (eg, what handling is contraindicated). If used, a negative word should exist emphasized using one or more than of the following: italics, all capital letters, underlining, or boldface blazon.

In add-on to the stem and correct answer, careful consideration should also exist paid to writing MCQ distractors. Distractors should be grammatically consistent with the stem, similar in length, and plausible, and should not overlap.24 Apply of "all of the above" and "none of the above" should be avoided as these options decrease the reliability of the detail.25 As few as two distractors are sufficient, merely information technology is common to use three to four. Determining the appropriate number of distractors depends largely on the number of plausible choices that tin can be written. In fact, evidence suggests that using four or more options rather than iii does not amend item performance.26 Additionally, a desirable trait of any distractor is that it should appeal to low-scoring students more than to high-scoring students because the goal of the examination is to differentiate students according to their level of achievement (or training) and not their test-taking abilities.27

Multiple-answer, multiple-response, or "select all that employ" items are equanimous of groups of true-false statements nested nether a single stalk and require the test-taker to make a judgment on each answer option, and may exist graded using partial credit or an "all or nothing" requirement.4 A like arroyo, known as K-type, provides the individual reply choices in addition to various combinations (eg, A and B; A and D; B, C, and East). Notably, Thousand-type items tend to have lower reliability than "select all that use" items because of the greater likelihood that an examinee tin can gauge the correct answer through a procedure of elimination; therefore, apply of K-type items is generally not recommended.4,24 Should a faculty member decide to use K-type items, we recommend that they include at to the lowest degree one correct answer and 1 incorrect answer. Otherwise, examinees are apt to believe it is a "trick" question, equally they may notice it unlikely that all choices are either correct or incorrect. Faculty members should also be careful not to swivel the answer choices within a multiple-answer item; the examinee should be required to evaluate each choice independently.

Matching and hot-spot items are two additional forms of selected-response items, and although they are used less frequently, their complexity may offer a convenient mode to assess an examinee'southward grasp of primal concepts.28 Matching items tin can assess knowledge and some comprehension if synthetic appropriately. In these items, the stem is in one column and the correct response is in a 2nd cavalcade. Responses may be used once or multiple times depending on detail blueprint. One advantage to matching items is that a large amount of cognition may be assessed in a minimum amount of infinite. Moreover, instructor training fourth dimension is lower compared to the other item types presented above. These aspects may exist peculiarly important when the desired content coverage is substantial, or the material contains many facts that students must commit to memory. Brevity and simplicity are best practices when writing matching items. Each particular stem should be short, and the list of items should exist brief (ie, no more than 10-xv items). Matching items should also contain items that share the aforementioned foundation or context and are arranged in a systematic lodge, and clear directions should exist provided as to whether answers are to be used more than once.

Hot spot items are engineering science-enhanced versions of multiple-choice items. These items allow students to click areas on an image (eg, identify an anatomical structure or a component of a circuitous procedure) and select i or more than answers. The advantages and disadvantages of hot spots are like to those of multiple-choice items; all the same, there are minimal data currently available to guide best practices for hot spot particular development. Additionally, they are only available through certain types of testing platforms, which means non all faculty members may have access to this technology-assisted item type.29

Some educators suggest that performance on MCQs and other types of selected response items is artificially inflated as examinees may rely on recognition of the information provided by the respond choices.11,30 Synthetic-response items such as fill-in-the-blank (or completion), brusque respond, and essay may provide a more than accurate cess of cognition because the examinee must construct or synthesize their ain answers rather than selecting them from a list.5 Fill-in-the-blank (FIB) items differ from short answer and essay items in that they typically require but one- or ii-word responses. These items may exist more effective to minimize guessing compared to selected response items. Nonetheless, compared to short reply and essay items, developing FIB items that assess higher levels of learning can be challenging because of the limited number of words needed to answer the item.28 Backup-the-blank items may crave some caste of manual grading as accounting for the exact answers students provide or for such nuances as capitalization, spacing, spelling, or decimal places may be difficult when using automated grading tools.

Short-reply items accept the potential to effectively assess a combination of correct and incorrect ideas of a concept and measure a pupil's ability to solve issues, apply principles, and synthesize data.10 Short-answer items are also straightforward to write and tin can reduce educatee cheating because they are more difficult for other students to view and copy.31 Even so, results from short-answer items may take limited validity as the examinee may not provide enough information to allow the instructor to fully discern the extent to which the educatee knows or comprehends the information.4,5 For example, a educatee may misinterpret the prompt and merely provide an answer that tangentially relates to the concept tested, or because of a lack of confidence, a educatee may non write about an area he or she is uncertain nearly.5 Grading must be accomplished manually in most cases, which can often be a deterrent to using this item type, and may also be inconsistent from rater to rater without a detailed primal or rubric.ten,28

Essay response items provide the opportunity for faculty members to appraise and students to demonstrate greater knowledge and comprehension of form material beyond that of other item formats.10 At that place are two master types of essay item formats: extended response and restricted response. Extended response items allow the examinee complete liberty to construct their answer, which may exist useful for testing at the synthesis and evaluation levels of Flower's taxonomy. Restricted response provides parameters or guides for the response, which allows for more consistent scoring. Essay items are also relatively easy for faculty members to develop and often necessitate that students demonstrate critical thinking as well as originality. Disadvantages include existence able to assess only a limited amount of textile because of the time needed for examinees to complete the essay, decreased validity of examination score interpretations if essay items are used exclusively, and substantial time required to score the essays. Moreover, as with short answer, there is the potential for a loftier caste of subjectivity and inconsistency in scoring.9,11

Important all-time practices in constructing an essay particular are to state a defined task for the examinee in the instructions, such as to compare ideas, and to limit the length of the response. The latter is especially of import on an examination with multiple essay items intended to assess a wide array of concepts. Some other recommendation is for faculty members to have a clear idea of the specific abilities they wish for students to demonstrate before writing an item. A final recommendation is for faculty members to develop a prompt that creates "novelty" for students then that they must use knowledge to a new situation.ten Ane of two methods is unremarkably employed in evaluating essay responses: an analytic scoring model, where the instructor prepares an platonic respond with the major components identified and points assigned, or a holistic approach in which the instructor reads a student'south unabridged essay and grades it relative to other students' responses.28,30 Analytic scoring is the preferred method because it tin reduce subjectivity and thereby atomic number 82 to greater score reliability.

Literature on Item Types and Educatee Outcomes

In that location are limited data in the literature comparing student outcomes by item type or number of distractors. Hubbard and colleagues conducted a cross-over study to identify differences in multiple true-false and free-response examination items.5 The study plant that while correct response rates correlated across the two formats, a higher pct of students provided correct responses to the multiple true-false items than to the free response questions. Results also indicated that a higher prevalence of students exhibited mixed (right and incorrect) conceptions on the multiple true-simulated items vs the free-response items, whereas a higher prevalence of students had fractional (right and unclear) conceptions on costless-response items. This study suggests that multiple-true-imitation responses may direct students to specific concepts simply obscure their critical thinking. Conversely, gratuitous-response items may provide more critical-thinking cess while at the aforementioned time offering express information on incorrect conceptions. The limitations of both detail types may be overcome by alternating betwixt the 2 within the same examination.5

In 1999, Martinez suggested that multiple-choice and constructed-response (free-response items) differed in cognitive demand equally well equally in the range of cognitive levels they were able to elicit.32 Martinez notes the inherent difficulty in comparing the 2 item types because of the fact that each may come up in a variety of forms and cover a range of different cognitive levels. However, he was able to identify several consequent patterns throughout the literature. First, both types may be used to assess data call up, agreement, evaluating, and problem solving, but constructed response are better suited to assess at the level of synthesis. Second, although they may be used to assess at higher levels, most multiple-pick items tend to assess knowledge and understanding in role considering of the expertise involved in writing valid multiple-choice items at higher levels. Third, both types of items are sensitive to examinees' personal characteristics that are unrelated to the topic being assessed, and these characteristics can pb to unwanted variance in scores. One such characteristic that tends to present bug for multiple-selection items more so than for constructed-response items is known as "testwiseness," or the skill of choosing the right answers without having greater noesis of the material than another, comparable student. Another student characteristic that affects student performance is test feet, which is often of greater concern when crafting constructed-response items than multiple-choice items. Finally, Martinez concludes that pupil learning is affected by the types of items used on examinations. In other words, students study and learn textile differently depending on whether the examination will exist predominantly multiple-choice items, constructed response, or a combination of the 2.

In summary, the number of empirical studies looking at the properties, such as reliability or level of knowledge, and student outcomes on written examinations based upon use of one particular type compared to another is currently limited. The few available studies and existing theory propose the utilize of different detail types to assess distinct levels of educatee cognition. In addition to the consideration of intended level(s) of noesis to be assessed, each particular type has singled-out advantages and disadvantages regarding the corporeality of faculty preparation and grading time involved, expertise required to write quality items, reliability and validity, and pupil time required to answer. Consequently, a mixed approach that makes utilize of multiple types of items may exist most appropriate for many grade-based examinations. Faculty members could, for example, include a serial of multiple-choice items, several fill-in-the-blank and brusk answer items, and mayhap several essay items. In this manner, the instructor tin take advantage of each item type while avoiding one or a few perpetual disadvantages associated with a type.

Technical Flaws in Item Writing

There are common technical flaws that may occur when examination items of whatsoever blazon do not follow published best practices and guidelines such every bit those shown in Table 1. Item flaws introduce systematic errors that reduce validity and can negatively impact the performance of some test takers more so than others.7 There are two categories of technical flaws: irrelevant difficulty and "test-wiseness."4 Irrelevant difficulty occurs when there is an artificial increase in the difficulty of an detail because of flaws such as options that are too long or complicated, numeric data that are not presented consistently, use of "none of the above" as an pick, and stems that are unnecessarily complicated or negatively phrased.2 , 12 These and other flaws can add construct-irrelevant variance to the terminal test scores because the item is challenging for reasons unrelated to the intended construct (cognition, skills, or abilities) to exist measured.33 Certain groups of students, for example, those who speak English equally a second language or have lower reading comprehension ability, may be particularly impacted by technical flaws, leading to irrelevant difficulty. This "contaminating influence" serves to undermine the validity of interpretations drawn from examination scores.

Test-wise examinees are more perceptive and confident in their test-taking abilities compared to other examinees and are able to place cues in the detail or respond choices that "give away" the answer.iv Such flaws reward superior test-taking skills rather than noesis of the material. Examination-wise flaws include the presence of grammatical cues (eg, distractors having different grammar than the stalk), grouped options, accented terms, right options that are longer than others, word repetition between the stem and options, and convergence (eg, correct answer includes the most elements in common with the other options).iv Because of the potential for these and other flaws, the authors strongly encourage kinesthesia members review Tabular array 1 or the list of particular-writing recommendations adult by Haladyna and colleagues when preparing examination items.24 Faculty members should consider asking a colleague to review their items prior to administering the examination as an additional ways of identifying and correcting flaws and providing some assurance of content-related validity, which aims to determine whether the exam content covers a representative sample of the knowledge or beliefs to be assessed.34 For standardized or high-stakes examinations, a much more rigorous process of gathering multiple types of validity show should be undertaken; notwithstanding, this is neither required nor applied for the majority of course-based examinations.fifteen Conducting an item assay later students take completed the examination is important as this may identify flaws that may non have been clear at the time the examination was developed.

Overview of Item Analysis

An important opportunity for faculty learning, comeback, and self-assessment is a thorough post-examination review in which an item assay is conducted. Electronic testing platforms that present detail and examination statistics are widely bachelor, and kinesthesia members should accept a full general understanding of how to translate and appropriately utilize this information.35 Detail analysis is a powerful tool that, if misunderstood, can lead to inappropriate adjustments following commitment and initial scoring of the examination. Unnecessarily removing or score-adjusting items on an examination may produce a range of undesirable bug including poor content representation, student entitlement, course inflation, and failure to hold students accountable for learning challenging material.

1 of the near widely used and simplest particular statistics is the item difficulty alphabetize (p), which is expressed every bit the per centum of students who correctly answered the item.ten For instance, if 80% of students answered an detail correctly, p would be 0.80. Theoretically, p can range from 0 (if all students answered the particular incorrectly) to 1 (if all students answered correctly). However, Haladyna and Downing note that considering of students guessing, the applied lower jump of p is 0.25 rather than zilch for a 4-choice item, 0.33 for a three-pick item, and so forth.27 Item difficulty and overall exam difficulty should reflect the purpose of the assessment. A competency-based examination, or i designed to ensure that students take a basic understanding of specific content, should contain items that most students answer correctly (high p value). For course-based examinations, where the purpose is ordinarily to differentiate between students at diverse levels of achievement, the items should range in difficulty so that a large distribution of student total scores is attained. In other words, little data is obtained about student comprehension of the content if near items were extremely difficult (eg, p<.30) or easy (eg, p>.90). For quality improvement, it is just as important to evaluate items that nigh every pupil answers correctly equally those with a low p. In reviewing p values, 1 should also consider the expectations for the intended outcome of each item and topic, which can exist anticipated through utilize of careful planning and exam blueprinting every bit noted earlier. For example, some fundamental concepts that the instructor emphasizes many times or that require elementary recall may lead to most students answering correctly (loftier p), which may be acceptable or even desirable.

A 2nd common mensurate of item performance is the item discrimination index (d), which measures how well an item differentiates between low- and high-performing students.36 There are several unlike methods that tin be used to calculate d, although it has been shown that most produce comparable results.34 One approach for calculating d when scoring is dichotomous (correct or incorrect) is to subtract the percent of low-performing students who answered a given item correctly from the percentage of loftier-performing students who answered correctly. Appropriately, d ranges from -1 to +i, where a value of +i represents the extreme case of all loftier-scorers answering the detail correctly and all depression-scorers incorrectly, and -1 represents the example of all high-scorers answering incorrectly and low-scorers correctly.

How students are identified as either "loftier performing" or "low performing" is somewhat arbitrary, but the most widely used cutoff is the top 27% and bottom 27% of students based upon full test score. This do stems from the need to identify extreme groups while having a sufficient number of cases in each group. The 27% represents the location on the normal curve where these 2 criteria are approximately balanced.34 Withal, for very small class sizes (almost 50 or fewer), defining the upper and lower groups using the 27% dominion may still atomic number 82 to unreliable estimates for detail discrimination.38 Ane pick for addressing this issue is to increase the size of the high- and depression-scoring groups to the upper and lower 33%. In practice, this may non be feasible as kinesthesia members may be limited past the automated output of an test platform, and we suspect most kinesthesia members will non have the time to routinely perform such calculations by paw or using another platform. Alternatively, 1 can calculate (or refer to the test output if available) a phi (ϕ) or point biserial (PBS) correlation coefficient between each pupil's response on an item and overall performance on the test.34 Regardless of which of these calculation methods is used, the interpretation of d is the aforementioned. For all items on a commercial, standardized examination, p should be at least 0.thirty; however, for class-based assessments it should at least exceed 0.15.36,37 A summary of the definitions and employ of dissimilar particular statistics, including difficulty and discrimination, besides as exam reliability measures is found in Tabular array three.

Tabular array iii.

Definitions of Item Statistics and Examination Reliability Measures to be Used to Ensure Best Practices in Exam Item Construction18-25,41

An external file that holds a picture, illustration, etc.  Object name is ajpe7204-t3.jpg

Some other key gene used in diagnosing item performance, specifically on multiple-choice items, is the number of students who selected each possible answer. Answer choices that few or no students selected practice not add together value and need revision or removal from future iterations of the examination.38 Additionally, an wrong reply choice that was selected as often as (or more often than) the correct answer could point an event with item wording, the potential of more than i correct reply selection, or fifty-fifty miscoding of the correct answer choice.

Examination Reliability

Implications for the quality of each private examination item extend beyond whether it provides a valid measure of student achievement for a given content area. Collectively, the quality of items affects the reliability and validity of the overall examination scores. For this reason and the fact that many existing electronic testing platforms provide examination reliability statistics, the authors have identified that a brief discussion of this topic is warranted. In that location are several classic approaches in the literature for estimating the reliability of an test, including test-retest, parallel forms, and subdivided exam.37 Inside courses, the offset two are rarely used equally they require multiple administrations of the same examination to the same individuals. Instead, i or multiple variants of subdivided test reliability are used, near notably carve up-half, Kuder-Richardson, or Cronbach blastoff. As the name implies, split-half reliability involves the division of examination items into equivalent halves and calculating the correlation of student scores between the two parts.38 The purpose is to provide an gauge of the accuracy with which an examinee'due south cognition, skills, or traits are measured past the examination. Several formulas be for split-one-half, but the most common involves the calculation of a Pearson bivariate correlation (r).37 Several limitations exist for split-half reliability, notably the use of a single musical instrument and administration too as sensitivity to speed (timed examinations), both of which can lead to inflated reliability estimates.

The Kuder-Richardson formula, or KR20, was developed as a measure out of internal consistency of the items on a scale or exam. It is an appropriate measure of reliability when item answers are dichotomous and the exam content is homogenous.38 When examination items are ordinal or continuous, Cronbach blastoff should exist used instead. The KR20 and alpha tin range from 0 to 1, with 0 representing no internal consistency and values approaching 1 indicating a high caste of reliability. In full general, a KR20 or alpha of at least 0.50 is desired, and most form-based examinations should range between 0.60 and 0.fourscore.39,xl The KR20 and alpha are both dependent upon the total number of items, standard deviation of total examination scores, and the bigotry of items.9 The dependence of these reliability coefficients on multiple factors suggests there is not a set up minimum number of items needed to achieve the desired reliability. Notwithstanding, the inclusion of additional items that are similar in quality and content to existing items on an examination will by and large improve examination reliability.

As noted to a higher place, KR20 and alpha are sensitive to examination homogeneity, pregnant the extent to which the examination is measuring the same trait throughout. An examination that contains somewhat disparate disciplines or content may produce a low KR20 coefficient despite having a sufficient number of well-discriminating items. For instance, an examination containing x items each for biochemistry, pharmacy ethics, and patient assessment may exhibit poor internal consistency considering a educatee's power to perform at a high level in one of these areas is not necessarily correlated with the student'southward ability to perform well in the other two. I solution to this consequence is to divide such an examination into multiple, single-trait assessments, or simply calculate the KR20 separately for items measuring each trait.36 Because of the limitations of KR20 and other subdivided measures of reliability, these coefficients should be interpreted in context and in conjunction with item analysis data every bit a means of improving hereafter administrations of an examination.

Another means of examining the reliability of an examination is using the standard error of measurement (SEM) of the scores it produces.34 From classical test theory, it is understood that no assessment can perfectly measure the desired construct or trait in an individual because of diverse sources of measurement error. Conceptually, if the same assessment were to be administered to the same student 100 times, for example, numerous different scores would be obtained.41 The mean of these 100 scores is causeless to represent the student's truthful score, and the standard departure of the cess scores would be mathematically equivalent to the standard error. Thus, a lower SEM is desirable (0.0 is the ideal standard) equally information technology leads to greater conviction in the precision of the measured or observed score.

In practice, the SEM is calculated for each individual student's score using the standard divergence of examination scores and the reliability coefficient, such as the KR20 or Cronbach alpha. Bold the distribution of test scores is approximately normal, there is a 68% probability that a student's true score is within ±i SEM of the observed score, and a 95% probability that it is within ±ii SEM of the observed score.9 For example, if a student has a measured score of fourscore on an exam and the SEM is five, there is a 95% probability that the student'south truthful score is between 70 and 90. Although SEM provides a useful measure of the precision of the scores an examination produces, a reliability coefficient (eg, KR20) should be used for the purpose of comparison one test to some other.ten

Post-exam Detail Review and Score Adjustment

Kinesthesia members should review the item statistics and exam reliability information as soon as it is available and, ideally, prior to releasing scores to students. Review of this information may serve to both identify flawed items that warrant immediate attention, including any that have been miskeyed, and those that should be refined or removed prior to future administrations of the same examination. When interpreting p and d, the instructor should follow published guidelines but avoid setting any hard "cutoff" values to remove or score-arrange items.39 Some other important consideration in the estimation of item statistics is the length of the examination. For an assessment with a small number of items, the item statistics should not be used because students' total exam scores will not exist very reliable.42 Moreover, interpretation and utilise of detail statistics should be performed judiciously, considering all available data before making changes. For example, an detail with a difficulty of p=.3, which indicates only 30% of students answered correctly, may announced to exist a strong candidate for removal or adjustment. This may be the case if it also discriminated poorly (eg, if d =-0.three). In this case, few students answered the item correctly and low scorers were more probable than high scorers to do so, which suggests a potential flaw with the item. It could indicate incorrect coding of reply choices or that the item was confusing to students and those who answered correctly did so past guessing. Alternatively, if this aforementioned item had a d = 0.5, the teacher might non remove or adjust the item because information technology differentiated well between high- and low-scorers, and the low p may simply indicate that many students found the item or content challenging or that less instruction was provided for that topic. Detail difficulty and discrimination ranges are provided in Table four along with their interpretation and general guidelines for item removal or revision.

Table 4.

Recommended Interpretations and Deportment Using Item Difficulty and Discrimination Indices to Ensure All-time Practices in Examination Detail Construction 36

An external file that holds a picture, illustration, etc.  Object name is ajpe7204-t4.jpg

In general, instructors should routinely review all items with a p<.v-.6.38,43 In cases where the respond choices have been miscoded (eg, one or more correct responses coded as incorrect), the instructor should simply recode the answer central to honor credit appropriately. Such coding errors can generally be identified through examination of both the particular statistics and frequency of student responses for each answer selection. Again, this blazon of adjustment does not present any ethical dilemmas if performed before students' scores are released. In other cases, score adjustment may appear less straightforward and the instructor has several options available (Tabular array iv). A poorly performing detail, identified equally ane having both a low p (<.60) and d (<.15), is a possible candidate for removal because the particular statistics propose those students who answered correctly well-nigh likely did and so by guessing.38 This approach, however, has drawbacks considering it decreases the denominator of points possible and at least slightly increases the value of those remaining. A similar adjustment is that the teacher could award full credit for the item to all students, regardless of their specific response. Alternatively, the instructor could retain the poorly performing detail and award partial credit for some reply choices or treat it as a bonus. Depending upon the type and severity of the issue(southward) with the particular, either application partial credit or bonus points may be more desirable than removing the item from counting towards students' full scores because these solutions practice not accept away points from those who answered correctly. However, these types of adjustments should only exist done when the item itself is not highly flawed but more challenging or avant-garde than intended.38 For example, treating an item equally a bonus might be appropriate when p≤.three and d ≥ 0.fifteen.

Every bit a concluding annotate on score adjustment, kinesthesia members should note that form-based examinations are probable to contain quite a few flawed items. A written report of basic scientific discipline examinations in a Doctor of Medicine program determined that betwixt 35% and 65% of items independent at least one flaw.7 This suggests that faculty members will need to find a healthy residual betwixt providing score-adjustments on examinations out of fairness to their students and maintaining the integrity of the examination past not removing all flawed items. Thus, we suggest that test score adjustments be made sparingly.

Regarding revision of items for future use, the same guidelines discussed above and presented in Table 4 agree true. Item statistics are an of import means of identifying and therefore correcting item flaws. The frequency with which answer options were selected should as well be reviewed to determine which, if whatsoever, distractors did non perform fairly. Haladyna and Downing noted that when less than five% of examinees select a given distractor, the distractor probably simply attracted random guessers.26 Such distractors should be revised, replaced, or removed birthday. As noted previously, including more options rarely leads to better item performance, and the presence of two or three distractors is sufficient. Examination reliability statistics (the KR20 or alpha) do not offering sufficient information to target item-level revisions, just may be helpful in identifying the extent to which item flaws may exist reducing the overall examination reliability. Additionally, the reliability statistics can bespeak toward the presence of multiple constructs (eg, different types of content, skills, or abilities), which may not have been the intention of the instructor.

In summary, instructors should carefully review all available item information before determining whether to remove items or adjust scoring immediately following an examination and consider the implications for students and other instructors. Each school may wish to consider developing a common set of standards or bet practices to assist their kinesthesia members with these decisions. Exam and particular statistics may also be used by faculty members to improve their examinations from year to yr.

Decision

Assessment of educatee learning through exam is both a science and an art. It requires the power to organize objectives and plan in advance, the technical skill of writing examination items, the conceptual agreement of item analysis and examination reliability, and the resolve to continually improve one'south role as a professional educator.

REFERENCES

ane. Ahmad RG, Hamed OAE. Impact of adopting a newly developed blueprinting method and relating it to particular analysis on students' performance. Med Teach. 2014;36(SUPPL.1):55–62. [PubMed] [Google Scholar]

2. Kubiszyn T, Borich GD. Educational Testing and Measurement. Hoboken, NJ: Wiley; 2016. [Google Scholar]

3. Brown Southward, Knight P. Assessing Learners in Higher Education. New York, NY: Routledge Falmer; 1994. [Google Scholar]

5. Hubbard JK, Potts MA, Couch BA. How question types reveal student thinking: an experimental comparing of multiple-true-false and free-response formats. CBE Life Sci Educ. 2017;sixteen(2):one–13. [PMC free commodity] [PubMed] [Google Scholar]

vi. Suskie L. Assessing Educatee Learning: A Mutual Sense Guide. 2nd ed. San Francisco, CA: Jossey Bass; 2009. [Google Scholar]

7. Downing SM. The furnishings of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Adv Wellness Sci Educ. 2005;x:133–143. doi: 10.1007/s10459-004-4019-5. [PubMed] [CrossRef] [Google Scholar]

8. Downing SM, Haladyna TM. Exam detail development validity evidence from quality assurance procedures. Appl Meas Educ. 1997;10(ane):61–82. [Google Scholar]

ix. Cohen RJ, Swerdlik . Psychological Testing and Assessment: An Introduction to Tests and Measurement. ivth ed. Mount View, CA: Mayfield; 1999. [Google Scholar]

10. Thorndike RL, Hagen EP. Measurement and Evaluation in Psychology and Education. viiith ed. New York, NY: Pearson; 2008. [Google Scholar]

eleven. Downing SM. Selected-response particular formats in test development. In: Downing SM, Haladyna T, editors. Handbook of Test Evolution. Mahwah, NJ: Lawrence Erlbaum; 2006. pp. 287–302. [Google Scholar]

12. Ray ME, Daugherty KK, Lebovitz 50, Rudolph MJ, Shuford VP, DiVall MV. Best practices on exam structure, assistants, and feedback. Am J Pharm Educ. 2018;82(10):Article 7066. doi: 10.5688/ajpe7066. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]

xiii. Roid Thou, Haladyna T. The emergence of an item-writing engineering. Rev of Educ Res. 1980;50(ii):293–314. [Google Scholar]

14. Wendler CLW, Walker ME. In: Practical issues in designing and maintaining multiple test forms for large-scale programs. Handbook of Test Evolution. Downing SM, Haladyna T, editors. Mahwah, NJ: Lawrence Erlbaum; 2006. pp. 445–468. [Google Scholar]

15. Downing SM. Validity: on the meaningful interpretation of assessment information. J Med Educ. 2003;37:830–837. [PubMed] [Google Scholar]

16. Airasian P. Cess in the Classroom. New York, NY: McGraw Hill; 1996. [Google Scholar]

17. Zohar D. An condiment model of test feet: part of exam-specific expectations. J Educ Psych. 1998;ninety(2):330–340. [Google Scholar]

18. Bridge PD, Musial J, Frank R, Thomas R, Sawilowsky South. Measurement practices: methods for developing content-valid student examinations. Med Teach. 2003;25(four):414–421. [PubMed] [Google Scholar]

19. Al-Rukban MO. Guidelines for construction of multiple-option items tests. J Family Customs Med. 2006;13(3):125–133. [PMC free article] [PubMed] [Google Scholar]

24. Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice detail-writing guidelines for classroom assessment. Appl Meas Educ. 2002;15(three):309–334. [Google Scholar]

25. Hansen JD, Dexter L. Quality multiple-choice test items: item-writing guidelines and an assay of auditing exam banks. J Educ Coach. 1997;72(ii):94–97. [Google Scholar]

26. Haladyna TM, Downing SM. How many options is enough for a multiple-choice test item? Educl Psych Meas. 1993;53(four):999–1010. [Google Scholar]

27. Haladyna TM, Downing SM. Developing and Validating Multiple Pick Examination Items. Mahwah, NJ: Lawrence Erlbaum; 1999. [Google Scholar]

xxx. Funk SC, Dickson KL. Multiple-choice and short respond exam functioning in a college classroom. Teach Psych. 2011;38(iv):273–277. [Google Scholar]

31. Impara JC, Foster D. In: Item and examination development strategies to minimize test fraud. Handbook of Test Evolution. Downing SM, Haladyna T, editors. Mahwah, NJ: Lawrence Erlbaum; 2006. pp. 91–114. [Google Scholar]

32. Martinez ME. Cognition and the question of test item format. Educl Psychologist. 1999;34(iv):207–218. [Google Scholar]

33. Haladyna TM, Downing SM. Construct-irrelevant variance in high-stakes testing. Educ Meas Issues Prac. 2004;23(1):17–27. [Google Scholar]

34. Anastasi A, Urbina S. Psychological Testing. seventhursday ed. New York, NY: Pearson; 1997. [Google Scholar]

35. Rudolph MJ, Lee KC, Assemi Thou, et al. Surveying the current landscape of cess structures and resource in US schools and colleges of pharmacy. Curr Pharm Teach Acquire. In Printing. [PubMed] [Google Scholar]

36. Lane Due south, Raymond MR, Haladyna TM. Handbook of Exam Development (Educational Psychology Handbook) 2nd ed. New York, NY: Routledge; 2015. [Google Scholar]

37. Secolsky C, Denison DB. Handbook on Measurement, Assessment, and Evaluation in Higher Education. twond ed. New York, NY: Routledge; 2018. [Google Scholar]

38. McDonald ME. The Nurse Educators Guide to Assessing Student Learning Outcomes. ivth ed. Burlington, MA: Jones & Bartlett; 2017. [Google Scholar]

39. Frey BB. Sage Encyclopedia of Educational Research, Measurement, and Evaluation. Thousand Oaks, CA: Sage; 2018. [Google Scholar]

40. Van Blerkhom ML. Measurement and Statistics for Teachers. New York, NY: Routledge; 2017. [Google Scholar]

41. Kline P. Handbook of Psychological Testing. 2nd ed. New York, NY: Routledge.

42. Livingston SA. Item analysis. In: Downing SM, Haladyna TM, editors. Handbook of Test Evolution. Mahwah, NJ: Lawrence Erlbaum; 2006. pp. 421–444. [Google Scholar]

43. Billings DM, Halstead JA. Educational activity in Nursing e-Book: A Guide for Faculty. fourthursday ed. St. Louis, MO: Elsevier Saunders; 2013. [Google Scholar]


Articles from American Periodical of Pharmaceutical Educational activity are provided here courtesy of American Association of Colleges of Pharmacy


sotomountrady.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788158/

0 Response to "As the Result of a Sensitivity Review Items Containing __________ May Be Eliminated From a Test"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel