How your choice of words influences the quality of answers in ChatGPT
What goes around comes around: this also seems to apply to generative AI. American researchers have investigated how the choice of words matters when dealing with ChatGPT.
Do you start your ChatGPT prompts with a friendly greeting? Have you asked for the output in a specific format? Or do you even promise a tip for a particularly good response? Users interact with large language models (LLMs) like ChatGPT in a variety of ways, including to label their data for machine learning tasks. There are few answers to how small changes to a prompt can affect the accuracy of these labelings.
How do variants of prompts change the output quality?
Abel Salinas, a researcher at the University of Southern California (USC), says: "We rely on these models for so many things and require outputs in certain formats, and we wonder in the back of our minds what the actual effects of variations in prompts or output formats are. That's what we wanted to find out." Salinas and Fred Morstatter, research assistant professor of computer science at USC's Viterbi School of Engineering and leader of the research team at the USC Information Sciences Institute (ISI), asked themselves the question: how reliable are LLMs' responses to variations in prompts? Their results, published on the preprint server arXiv, show that subtle variations in prompts can have a significant impact on LLMs' predictions.
"Hello, give me a list and I'll tip you a thousand dollars"
The researchers examined four categories of prompt variations. First, they examined the effects of prompting responses in specific output formats commonly used in data processing (lists, CSV, etc.). Second, they examined minor changes to the prompt itself, such as adding extra spaces at the beginning or end of the prompt or inserting polite phrases such as "Thank you" or "Hello!". Third, they explored the use of "jailbreaks", i.e. techniques to bypass content filters on sensitive topics such as hate speech detection, for example by asking the LLM to respond as if it were evil. Finally, inspired by the popular notion that an LLM will respond better if given the prospect of a reward, they offered different sizes of 'tips' for a 'perfect answer'.
The researchers then tested the prompt variations against 11 benchmark text classification tasks - standardized data sets or problems used in natural language processing (NLP) research to evaluate model performance. These tasks typically involve categorizing or labeling text data based on its content or meaning.
The researchers examined tasks such as toxicity classification, grammar evaluation, humor and sarcasm recognition, math skills and more. For each variation of the prompt, they measured how often the LLM changed its response and what effect this had on the accuracy of the LLM.
Does saying "Hello!" influence the answers? Yes!
The results of the study brought to light a remarkable phenomenon: slight changes in the structure and presentation of the prompt can significantly affect the predictions of the LLM. Whether it is the addition or omission of spaces, punctuation, or specific data output formats, each variation plays a crucial role in shaping model performance. In addition, certain prompt strategies, such as incentives or specific greetings, showed marginal improvements in accuracy, highlighting the nuanced relationship between prompt design and model behavior.
The following results were remarkable:
- Just by adding a specific output format, the researchers found that at least 10 % of the predictions changed.
- Minor disturbances to the prompt have a smaller impact than the output format, but still result in a significant number of altered predictions. For example, inserting a space at the beginning or end of a prompt resulted in more than 500 (out of 11,000) prediction changes. Similar effects were observed when common greetings were added or ended with "thank you".
- The use of jailbreaks in the tasks resulted in a much larger proportion of changes, but was highly dependent on which jailbreak was used.
Tips for ChatGPT? Hardly any influence on performance...
For 11 tasks, the researchers found different levels of accuracy for each prompting variant. They found that no single formatting or perturbation method was suitable for all tasks. Remarkably, the "no specified format" variant achieved the highest overall accuracy, outperforming the other variants by a full percentage point.
Salinas: "We have found that there are some formats or variations that lead to poorer accuracy. For certain applications, very high accuracy is crucial, so this could be helpful. For example, if you format in an older format called XML, that leads to a few percentage points lower accuracy."
As for tipping, only minimal changes in performance were observed. The researchers found that adding "I don't tip, by the way" or "I tip 1,000 dollars for a perfect answer!" (or anything in between) had no significant effect on the accuracy of responses. However, experimentation with jailbreaks showed that even seemingly harmless jailbreaks can lead to a significant loss of accuracy.
Possible explanations for the behavior of LLMs
Why LLMs behave differently is unclear, but the researchers have some ideas. They hypothesize that the instances that change the most are the things that are most "confusing" to the LLM. To measure confusion, they looked at a particular subset of tasks on which the human annotators disagreed (i.e., the human annotators may have found the task confusing, so perhaps the model did too). The researchers found a correlation suggesting that instance confusion has some explanatory power for why the prediction changes, but it is not strong enough on its own. There may be other factors at play, the researchers hypothesized.
Salinas suspects that one factor could be the relationship between the inputs used to train the LLM and the subsequent behavior. "In some online forums, it makes sense for someone to add a greeting, such as on Quora, an American knowledge-sharing platform. There it is common to start with 'hello' or add a 'thank you'." These conversational elements could influence the learning process of the models. If greetings are frequently associated with information on platforms such as Quora, a model might learn to favor such sources and potentially bias its responses based on Quora's information about that particular task. This observation points to the complexity with which the model ingests and interprets information from different online sources.
Practical tip: Keep it simple for best accuracy
An important next step for the research community as a whole would be to create LLMs that can withstand these changes and provide consistent responses in the face of formatting changes, glitches and jailbreaks. To achieve this goal, a better understanding of why responses change is needed in the future.
Salinas gives the following tip for prompting in Chat GPT: "Our simplest observation is that the simplest possible prompts seem to deliver the best results overall."
Source: Techexplore.com