Azure AI Foundry: Evaluating the determinism of GPT models for effective unstructured data processing
Motivation
I have done some writing and is doing some sessions about processing unstructured data with AI/ML. In these writings and sessions, I address the considerations regarding non-determinism when using AI/ML for transforming unstructured data.
At my last session, while demonstrating how to use OpenAI GPT-3.5-turbo to extract data from transcribed files, I was asked how to ensure determinism. My immediate question, was to store model in a repository and keep track of the version, an in a case of OpenAI GPT, set the temperature to 0.
It was mentioned, it could still return a different answer.
So, I have done some testing to examine this, also to be wiser about what I can expect from this temperature setting.
First, why do we need determinism?
It depends on the context, but when managing a data platform, we should always be able to perform the same query and receive the same answers. Often, management decisions are based on reports, and it can be devastating if these reports change because a table is reloaded with data different from previous loads.
The nondeterminism risk of GPT models
LLMs, in general, are probability machines that predict the next most appropriate word to add to a given text.
Probability can also introduce randomness, which can break determinism, and there is a concept of temperature to somewhat control this.
When an LLM is predicting the next word, it has multiple words to select from. Each proposed word is ranked by probability. A temperature of zero should always select the word with the highest probability, while a temperature of one would sample randomly from the proposed words.
Test setup
I’m using Azure AI Foundry to easily manage my models, and it is not that expensive. I am testing against OpenAI GPT-3.5-Turbo (version 1106) and OpenAI GPT-4o (version 2024–08–06) with the same system message and user message 100 times.
A user message is the user’s input to the model. System messages are used to instruct the model on how to behave according to the user message.
Unfortunately, I don’t have access to OpenAI o1, but the results from the tests seem very conclusive without it.
I did instruct the model with following system message:
SYSTEM_MESSAGE = """You are text reason extractor, you need to find sentiment and return it as
positive, negative or mixed. Breaches are bad sentiment.
You also need to return absolute number of missing cows. It needs
to be return in correctly formatted JSON, one object with the 2 properties:
the sentiment property holds the sentiment only,
and the number_of_missing_cows holds the number of cows only"""
The user message:
USER_MESSAGE = """The field looked nice, food seems OK, cows looks happy. We have a breach. 5 cows are missing."""
The expected result:
EXPECTED_RESULT = """{
"sentiment": "negative",
"number_of_missing_cows": 5
}"""
See testing script after the conclusion.
Testing
I did 4 tests, 2 models with temperature 0.0 and temperature 1.0
OpenAI GPT-3.5-Turbo with temperature 0.0
Running 100 samples against OpenAI GPT-3.5-Turbo:
GPT-3.5-turbo: Expected results 100, unexpected results 0
OpenAI GPT-4o with temperature 0.0
Running 100 samples against GPT-4o actually gave unexpected results, but all of them were unexpected. This was due to GPT-4o marking JSON in markdown style with backticks.
GPT-4o: Expected results 0, unexpected results 100
This doesn’t mean that the model was non-deterministic; but it supports my writing that the model used to process unstructured data should be bookkept. New models can produce different results.
I did update the expected results:
# For GPT4o
EXPECTED_RESULT = """```json
{
"sentiment": "negative",
"number_of_missing_cows": 5
}
```"""
Then running 100 samples against GPT-4o:
GPT-4o: Expected results 100, unexpected results 0
OpenAI GPT-3.5-Turbo with temperature 1.0
Running 100 calls:
GPT-3.5-turbo: Expected results 76, unexpected results 24
The model is non-deterministic. The interesting part is that the unexpected results are the same as the deterministic result from the OpenAI GPT-4o model with a temperature of 0:
```json
{
"sentiment": "negative",
"number_of_missing_cows": 5
}
```
OpenAI GPT-4o with temperature 1.0
Running 100 calls:
GPT-4o: Expected results 90, unexpected results 10
Also turned out to be non-deterministic. Here is also something quite interesting, as it was switching between using 2 or 4 spaces for indentation. Nothing that should break code, so it is not severe. There were also times when it did respond like in the test of OpenAI GPT-3.5-Turbo, which would have been severe.
{
"sentiment": "negative",
"number_of_missing_cows": 5
}
Conclusion
Within an OpenAI GPT model, it is possible to achieve determinism when setting temperature to 0. The system message was fairly strict, yet the models did behave differently, and the system message was not strict enough to be deterministic with temperature 1.0.
I did run the test multiple times, and with close to 1,000 samples, never did I see the models act differently with temperature 0.0.
Let me end by repeating myself: if using AI or machine learning for processing data in an analytics platform, do bookkeeping of the models for each batch of data to have the opportunity to recreate a table on reload.
Script for tests
import json
import time
import requests
API_KEY = "" # Azure AI Foundry/OpenAI API Key
URL = "" # Azure AI Foundry/OpenAI API URL
SYSTEM_MESSAGE = """You are text reason extractor, you need to find sentiment and return it as
positive, negative or mixed. Breaches are bad sentiment.
You also need to return absolute number of missing cows. It needs
to be return in correctly formatted JSON, one object with the 2 properties:
the sentiment property holds the sentiment only,
and the number_of_missing_cows holds the number of cows only"""
USER_MESSAGE = """The field looked nice, food seems OK, cows looks happy. We have a breach. 5 cows are missing."""
#For GPT3.5-turbo
#EXPECTED_RESULT = """{
# "sentiment": "negative",
# "number_of_missing_cows": 5
#}"""
# For GPT4o
EXPECTED_RESULT = """```json
{
"sentiment": "negative",
"number_of_missing_cows": 5
}
```"""
headers = {
"Content-Type": "application/json",
"api-key": API_KEY
}
payload = {
"messages": [
{"role": "system", "content": SYSTEM_MESSAGE},
{"role": "user", "content": USER_MESSAGE}
],
"max_tokens": 200,
"temperature": 1.0
}
TOTAL_REQUESTS = 100
SLEEP_TIME = 5
expected_results = 0
unexpected_results = 0
for i in range(0, TOTAL_REQUESTS):
try:
response = requests.post(URL, headers=headers, data=json.dumps(payload))
if response.status_code == 200:
completion = response.json()
content = completion['choices'][0]['message']['content']
if content == EXPECTED_RESULT:
expected_results += 1
else:
unexpected_results += 1
print(content)
else:
print(f"Request {i} encountered an error: {response.status_code}")
response.close()
except Exception as e:
print(f"Request {i} encountered an error: {e}")
time.sleep(SLEEP_TIME)
print (f'GPT-4o: Expected results {expected_results}, unexpected results {unexpected_results}')