COM-FSM Founding Day Judging Statistical Charts

College of Micronesia-FSM Founding day judging involved the use of four rubrics and eleven judges.

Four judges rated the floats on one rubric, and the team parade in motion on another rubric. These scores were added to generate the float/parade scores.  The two rubrics could have generated up to 40 points per judge, or 160 points for all four judges. No team scored more than 141 points. The total score and rank position is given below.

Group Sum Rank
Chuuk 139 2
Kosrae 118 6
Nukap 114 7
PingMok 141 1
Pohnpei national 99 8
Pohnpei state 135 5
Sapwafik 138 4
Yap 139 2

The results as announced awarded third place to Sapwafik giving Chuuk and Yap a tie at second place.

Each group had four scores, one for each judge. Four scores is not enough for a box plot, so the following chart simply displays the minimum to maximum range for the four scores for each state. A longer bar is a larger range between the smallest score and the largest score awarded and represents more variation in the scoring for that state. A shorter bar is a smaller range between the smallest score and the largest score awarded and represents less variation in the scoring for that state.

One of the questions I had was whether the rubric would generate a common sense of standards, that is, whether the judge's scores would generally concur in terms of their ranges of scoring. All four judges were faculty members who had taught, among other classes at the college, an introductory art course.

The box plots indicate that scores for judges one and three were distributed across the scoring range in a similar fashion. This does not necessarily mean that their scores coincided for a given float, only that their distribution of scores were similar. Judge two scored low compared to one and three. Judge four scored high compared to one and three. The sample sizes are too small to determine significance.

The rubric does not appear to produce inter-judge consistent distributions of scores.

The dance/performance section of the program was judged by seven judges selected by the participating groups. The intent was to answer concerns expressed in the past that judges might be biased. In the past foreigners were often tapped to be judges on the grounds that they would not be perceived to be biased towards (nor against) any particular group. The complication is that even foreigners have connections to the community. For example my selection as a judge raised concerns that I might be biased towards the group that includes my wife. There are very few, possibly no, truly neutral observers. Another complication was that foreigners do not know the custom and culture and thus had difficulty judging the dances and performances.

My solution was to build in the bias - count on bias occurring - and design for it. Instead of trying to find perfectly neutral judges, ask each group - each team - to nominate a judge from their own community. No other restrictions were placed, although I suggested that someone with knowledge of custom, culture, choreography, dance, and performance would be a good choice. I also suggested that this might be an opportunity to involve a respected culturally knowledgeable elder from their community, a suggestion a number of teams took up eagerly.

All eight groups made nominations. Each group was responsible for ensuring their judge was in place for the start of performances at 10:00. Six judges were in place by 10:00 and a seventh arrived in the nick of time to judge the first group. The eighth judge did not arrive. In my design, however, the responsibility for ensuring a judge was in place did fall to the judges coordinator but rather to the team. The loss of a judge would, according to the model, be to their own disfavor.

The groups had to choose a dance or performance rubric. Miscommunication early in the spring term led some groups to believe they had to do a dance and a performance, others thought the choice was "or" not "and". Thus some teams were preparing only one, others both. This did not come to light until way to late in the preparation process for the teams with only one to practice the other. So the decision was taken that teams had to declare a rubric choice in advance.

The dance rubric produced up to 28 points per judge. The performance option also generated up to 28 points per judge. Seven times 28 is 196, thus the total possible points any one team could achieve was 196 points.

Group Sum Rank
Chuuk 158 6
Kosrae 156 7
Nukap 145 8
PingMok 181 3
Pohnpei national 182 2
Pohnpei state 181 3
Sapwafik 175 5
Yap 190 1

The eventual winner, Yap, captured 190 of the 196 possible points.

Pohnpei national, the only group to use the performance rubric, captured second place. Their choice to take the road less traveled by may have made all the difference for Pohnpei national in their capture of second place. All other teams used the dance rubric.

Both the float/parade and the dance/performance sections generated ties in the top three. I had hoped that judging differentials would work against a tie occurring, but ties occurred for both. In fact, in the dance/performance section, second and third place were separated by only a single point. This suggests that the rubrics do not sufficiently distinguish among performances. One solution might be more metrics or a wider scale, but tallying the existing set of metrics is already a time-consuming and taxing endeavor.

Given that the judges were from diverse backgrounds and almost all, if not all, were new to judging dance and cultural presentations, the distribution of scores for each judge was of interest. Did the judges produce similar scoring distributions?

The answer is a fairly emphatic no, the judges did not produce similar scoring distributions based on the box plots above. One judge awarded only scores of 27 and 28, other judges generated wider spreads of scores. Outside of judge four, judges did appear to try to use the rubric to make distinctions among the teams. These differences in distributions, however, were both expected and unavoidable given the judge selection process.

When the scores are sliced by group, the distributions are less dissimilar.  Three groups saw a wide range of scores - essentially the judges disagreed for these three groups. Nukap had the lowest median score of 19, with Chuuk and Kosrae only slightly higher at 21.  Five other groups scored in a narrower and higher range of scores, with median scores of 25 and higher (Pohnpei State campus had a median of 25, the same as their first quartile value).

Yap, which captured 190 of 196 points, saw a narrow range of very high scores with median of 27 and both the third and fourth quartile at 28.

Sapwafik had a single low outlier at 20, the only outlier for any of the groups. The general lack outliers is a good sign. Had the chart been filled with high outliers, one might suspect that judges were low scoring all teams but their nominating team. The absence of high outliers suggests that the judges did not do this.

Were the judges biased towards their own nominating teams? This is a difficult question to answer statistically as there is only a sample of one to examine. One way to look at the data was whether a judge scored their own nominating team above or below the median score for that particular judge.

The orange bars indicate judges who awarded their own nominating team a higher score than their median. For the orange bars, the bottom of the bar is the median and the top is the score they awarded to the team that nominated them to the judges panel. Note that the score differences between the judge's median score and the score they gave their home island team are not large, four and less points. These four judges may have favored their own nominating team, or felt that the team performed above median, but the differences were still small. In other words, no judge was deliberately down scoring other teams while up scoring their nominating team.

Judge four awarded their nominating team the same score as their own median score, hence the bar has no vertical extent.

The two blue bars are judges who rated their own nominating team lower than their median score. For these bars the median score is the top and the bottom is the score the judge awarded to the team that nominated them.

One possible interpretation of blue bars would be that the judge was displeased with the performance of their home islanders relative to the other performances. Possibly the judge was a discerning judge who keenly knew the culture, custom, and dances, and had higher expectations of their own home islanders.

From an outsiders perspective, the system of designing in bias appears to have worked no less well than prior attempts to find truly neutral observers, and may have worked better. No judge expressed an inability to judge what they were watching, something past "neutral" judges had complained about. The  theoretically "biased" judges performed much as I actually thought they might: they made honest assessments of how each group, each team, performed. They may even have been less biased than "neutral" judges.

Popular posts from this blog

Box and whisker plots in Google Sheets

Traditional food dishes of Micronesia

Creating histograms with Google Sheets