Large language models and independent samples t-tests in statistics
Large language models are particularly ill suited to solving mathematical and statistical systems. The underlying neural networks that look for the next most likely word or phrase based on training data makes no actual calculations. Under the hood, some AI models are diverting mathematical questions to rules based models and mathematical engines.
In a data exploration exercises students were given data based on a research into soil organic carbon sequestration under two types of tree, nitrogen fixing trees in Fabaceae and trees that do not have nitrogen fixing capabilities in Myrtaceae. The soil organic carbon stored per square meter per year data was synthesized based on the original research report.
- The students were given data and then instructed:
- Calculate the mean for the Falcataria trees.
- Calculate the mean for the Eucalyptus trees.
- Make a labelled column chart of the two means.
- Use the function =TTEST(A2:A21,B2:B21,2,3) to calculate the p-value.
Then the students were told to copy and paste the following prompt into the AI of their own choice. Of 26 students, only a single student was unaware of AI. The other 25 students were already aware of AI and had used AI. The students are still at a stage where they are shy to openly admit that they use AI.
[Prompt]
Data table:
Eucalyptus trees,Falcataria trees
-27,144
-23,119
123,-47
-141,210
93,-33
72,5
-77,187
-28,162
-1,98
-56,181
154,-101
136,-60
-113,205
128,-53
1,93
50,36
-99,204
-26,136
42,62
52,24
The data table provides data on the kilograms of carbon sequestered in the soil per square meter per year for two types of trees in two different locations. The samples are independent samples. The trees were not paired in any way. Which type of tree sequesters more carbon? Is the difference significant? Run a hypothesis test for this data. Include a chart of the mean carbon sequestered by each tree type.
[End prompt]
The assignment instructions with the following directions:
Make a slides presentation that compares you calculations and results with those of the AI. Report on where they agree and disagree. Include charts from both your manual work and the AI - if the AI gave you a chart. Some AI programs might not generate a chart. If you get no chart from the AI, report that as a difference. In the discussion and conclusion for your presentation explain what the AI got right and what the AI got wrong.
If the AI does not report a specific p-value, report that in the presentation. If the AI does report a p-value, report that in the presentation.
In the presentation include the address of the AI website used and the name of the AI model. If the model includes who produced the model, include that information too. To determine the AI model, use the prompt, "What AI model is this?"
In the above slide a student has used a screenshot of their calculations in Google Sheets. The means and p-value are the correct values for this exercise.
The student also made a chart of the mean values.
Perplexity AI returned incorrect mean values, did not produce a chart, and refused to give a p-value. Perplexity did assert that the p-value was less than 0.05, but without citing the specific p-value.
OpenAI ChatGPT 5 returned the above results for another student. The test was to be a two-tailed, unequal variance.
For this student, ChatGPT 5 returned the above chart.
For another student ChatGPT 5.1 returned different incorrect results.









Comments
Post a Comment