Large language models and independent samples t-tests in statistics

Large language models are particularly ill suited to solving mathematical and statistical systems. The underlying neural networks that look for the next most likely word or phrase based on training data makes no actual calculations. Under the hood, some AI models are diverting mathematical questions to rules based models and mathematical engines. 

In a data exploration exercises students were given data based on a research into soil organic carbon sequestration under two types of tree, nitrogen fixing trees in Fabaceae and trees that do not have nitrogen fixing capabilities in Myrtaceae. The soil organic carbon stored per square meter per year data was synthesized based on the original research report.

  1. The students were given data and then instructed:
  2. Calculate the mean for the Falcataria trees. 
  3. Calculate the mean for the Eucalyptus trees.
  4. Make a labelled column chart of the two means.
  5. Use the function =TTEST(A2:A21,B2:B21,2,3) to calculate the p-value.

Then the students were told to copy and paste the following prompt into the AI of their own choice. Of 26 students, only a single student was unaware of AI. The other 25 students were already aware of AI and had used AI. The students are still at a stage where they are shy to openly admit that they use AI. 

[Prompt]

Data table:

Eucalyptus trees,Falcataria trees

-27,144
-23,119
123,-47
-141,210
93,-33
72,5
-77,187
-28,162
-1,98
-56,181
154,-101
136,-60
-113,205
128,-53
1,93
50,36
-99,204
-26,136
42,62
52,24

The data table provides data on the kilograms of carbon sequestered in the soil per square meter per year for two types of trees in two different locations. The samples are independent samples. The trees were not paired in any way. Which type of tree sequesters more carbon? Is the difference significant? Run a hypothesis test for this data. Include a chart of the mean carbon sequestered by each tree type.

[End prompt]

The assignment instructions with the following directions:

Make a slides presentation that compares you calculations and results with those of the AI. Report on where they agree and disagree. Include charts from both your manual work and the AI - if the AI gave you a chart. Some AI programs might not generate a chart. If you get no chart from the AI, report that as a difference. In the discussion and conclusion for your presentation explain what the AI got right and what the AI got wrong.

If the AI does not report a specific p-value, report that in the presentation. If the AI does report a p-value, report that in the presentation.

In the presentation include the address of the AI website used and the name of the AI model. If the model includes who produced the model, include that information too. To determine the AI model, use the prompt, "What AI model is this?"

In the above slide a student has used a screenshot of their calculations in Google Sheets. The means and p-value are the correct values for this exercise.


The student also made a chart of the mean values.


Perplexity AI returned incorrect mean values, did not produce a chart, and refused to give a p-value. Perplexity did assert that the p-value was less than 0.05, but without citing the specific p-value. 


OpenAI ChatGPT 5 returned the above results for another student. The test was to be a two-tailed, unequal variance.


For this student, ChatGPT 5 returned the above chart.

For another student ChatGPT 5.1 returned different incorrect results.


The student noted that ChatGPT 5.1 provided a description of the chart, but no chart.


A student who reported using ChatGPT 5 Mini had the above results returned. The values are correct and the chart is appropriate. 

Two other students reported using ChatGPT, but did not specify the model, and also obtained correct results. The unpredictability of the same model on the same data is interesting, perhaps a byproduct of whatever randomizations the LLM introduces.



Microsoft Copilot returned the correct means. 


Copilot also produced a chart. The black lines appear to be plus and minus one standard error of the mean. 

The above assignment is a second iteration of this assignment, with a previous run of the assignment in July 2025. In the July exercise the large language model produced a different result than an earlier version in May. In May the LLM made the appropriate analysis and obtained correct values. In July a newer version of the same LLM obtained incorrect values. 

The earlier split in result seen for ChatGPT models, some correct and some not, echoes the above split between the May and July results. LLM AI models are inconsistently inconsistent in the results produced from a data set. As one ought to expect them to be. 

And at other times ChatGPT is almost spookily correct as in here and here.

Future development of this assignment might include activities such as a classwide look at the results. 

Comments

Popular posts from this blog

Setting up a boxplot chart in Google Sheets with multiple boxplots on a single chart

Plotting polar coordinates in Desmos and a vector addition demonstrator

Traditional food dishes of Micronesia