Chart of raw scores on 3 related engineering assessment reports, Fall 2010

Re: chart of raw scores on 3 related engineering assessment reports, Fall 2010

Thanks for making this graph, its a helpful visualization of the raters’ differences. You, Jayme and I have confirmed that the three reports are substantially identical. The other observation you made the other day was that Raters 1 & 2 were together on one program which has led to that program having a different rating from the other two programs.

I re-made your chart, adding a solid black line for the average of 6 ratings (as if all raters read one report, which in effect they did). I added two dashed black lines at 1/2 point above and below the average. Points between the dashed lines are ratings within 1/2 point agreement of the average.

I see 3 blue, 1 red and 1 orange rating that are outside the 1/2 point agreement band.

One conversation could be about the quality of our internal norming (this was an inadvertent test of that).

More pressing now perhaps is how we represent this data back to the college and programs involved.

Gary & I chatted about my first graph and so I removed Rater 1, the most significant outlier and re-made the chart. In this 2nd chart, only one rating of 20 is outside the 1/2 point tolerance now. I conclude we should update Process Actions with the average scores for all 3 programs. The resulting averages of 5 raters are:

Dim 1:2.40
Dim 2:2.70
Dim 3:2.00
Dim 4:2.10

On 11/8/10 2:18 PM, “Kimberly Green” wrote:



Kimberly Green
Educational Designer / Assessment Specialist
Office of Assessment and Innovation
Washington State University
Pullman, Washington
(509) 335-5675

As you enter a classroom ask yourself this question: If there were no students in the room, could I do what I’m planning to do?  If your answer is yes, don’t do it.     -Ruben Cubero

Examining the quality of our assessment system

With most of the 59 programs rated for this round, we are beginning an analysis of our system of assessment.

To re-cap, we drafted a rubric about a year ago and tested it with Honors college self-study and a made-up Dept of Rocket Science self-study. We revised the rubric with discussions among staff and some external input. In December we used the rubric to rate reports on 3 of 4 dimensions (leaving off action plan in the first round). Based on observations in the December round, the rubric was revised in mid-spring 2010.

We tested the new rubric at a state-wide assessment conference workshop in late April, using a program’s report from December. The group’s ratings agreed pretty well with our staff’s (data previously blogged).

The May-August versions of the rubric are nearly identical, with only some nuance changes based on the May experiences.

The figure below is a study of the ratings of OAI staff on each of 4 rubric dimensions. It reports the absolute value of difference of the ratings for each pair of raters — a measure of the inter-rater agreement. We conclude that our ratings are in high agreement [a 54% are 0.5 point or closer agreement (85/156); 83% are 1.0 point or closer]. We also observe that the character of the distribution of agreement is similar across all four of the rubric dimensions.

Sanity Checking the May 17 scores

We have done a process to review language and scores on individual responses.  A couple weeks ago, Ashley led an activity to compare (in each dimension) across 4 programs. That found a couple anomalies.

This is an experiment to compare across the 22 programs that provided a report May 17. (Actually a few are not finished in the scoring process)

The activity uses a spreadsheet of the OAI contact, the Reviewer, the program and its scores.  The activity is for each OAI Contact to filter the list on themselves as contact, and then sort the programs by score on one dimension at a time — and see if the programs are in a reasonable ordinal order.

The activity continues for each reviewer, to filter on the programs they reviewed and then sort the programs by score on one dimension at a time —  and see if the programs are in a reasonable ordinal order.

The result is a variety of transects across the data, with the opportunity to spot something that is out of order

Different ideas were offered about what to look for:

  1. programs who’s scores were more than 1/2 point out of wack
  2. programs that were not in the right “bin” relative to others in that bin

The spreadsheet looks like this

Norming on the Guide to Assessment Rubric

Results of Norming on the Guide to Assessment rubric This image captures the data from OAI’s norming work on the first 2 dimensions of the May 17 Guide to Assessment rubric. The sample being assessed is the Sociology self-study from December 2009.

The red data are from Gary & Nils’ presentation to the Assessment, Teaching and Learning conference (“Compared to what?”) the end of April in Vancouver. ATL is a statewide conference hosted by the State Board of Community and Technical Colleges. We shared the rubric and the Sociology self study, divided the audience of 40 into 4 groups and had each group rate (un normed) the self-study using the rubric. We captured their data and displayed it in real time as a radar plot. From that effort we recruited several audience members to serve as external raters of the May 17 self studies.