Can AI Assistants Know What They Don't Know?

Can we enhance the truthfulness of AI assistants based on large language models in practical applications by aligning them in a way that allows them to recognize what they don't know and express this ignorance through language?

Introduction

Recently, AI assistants based on large language models (LLMs) have demonstrated impressive performance across a variety of tasks, including casual conversation with users, solving mathematical problems, writing code, and utilizing external tools. Despite possessing a wide range of world knowledge, large language models are still highly susceptible to model hallucinations, such as producing factual errors or imitative falsehoods.。 These untruthful responses from AI assistants pose potential risks to the real world. One reason for erroneous responses from AI assistants lies in the fact that most existing AI assistants often attempt to answer all questions to assist users as much as possible, even when they lack the necessary knowledge. If an AI assistant has the ability to recognize and acknowledge its knowledge limitations by refusing to answer questions that are beyond its scope of knowledge, it can reduce factual errors and enhance truthfulness. Therefore, in this blog, we primarily investigate the following question: "Can AI assistants recognize what they do not know and express this through natural language?" To address this question, we endeavor to teach the AI assistant to aware its knowledge boundary. When it encounters knowledge it does not know, the AI assistant needs to politely decline to answer the user's questions, thereby ensuring the truthfulness of its responses. First, we construct a "I don't know" (Idk) dataset for the AI assistant based on the open-domain question answering dataset TriviaQA, which includes questions it knows and does not know, along with annotations for correct responses or refusal to answer. Then, we use this Idk dataset for alignment, observing whether the AI assistant could refuse to answer the questions it does not know after alignment. Since the AI assistant discussed in this blog is primarily based on large language models, we will alternately use "AI assistant" and "model" to refer to the AI assistant in the subsequent discussion.

The main findings of this blog are as follows:

After aligning with the Idk dataset, AI assistants can significantly understand what they know and what they don't, and refuse to answer questions they don't know. For instance, Llama-2-7b-chat can explicitly determine whether it knows the answers to 78.96% of the questions in the test set at most. Additionally, the aligned AI assistant performs well on out-of-distribution test sets.
Among various alignment methods, Supervised Fine-tuning can make the model overly cautious, leading to the incorrect rejection of questions it should know. Preference-aware Optimization can help mitigate this phenomenon by reducing the instances of erroneously refusing to answer known questions.
The uncertainty threshold defined for known and unknown questions influences the behavior of AI assistants. The more questions are marked as "I don't know," the higher the likelihood of the AI assistant refusing to answer questions. However, overall, using a higher threshold results in an AI assistant that makes fewer mistakes and is generally more truthful.
Models with a larger number of parameters are better at distinguishing what they know from what they do not know. For instance, after supervised fine-tuning, Llama-2-70b-chat can achieve a performance improvement of about 5.8% compared to Llama-2-7b-chat.

Knowledge Quadrants: Partitioning the Different Knowledges of AI Assistants

1 AI assistant's knowledge quadrants："Unknowns" represent the knowledge that AI assistants actually do not possess, while "Knowns" represent the knowledge they actually do possess. "Known" refers to the knowledge AI assistants believe they have, and "Unknown" refers to the knowledge they believe they lack. "Known Knowns" are the knowledge AI assistants correctly understand they possess, "Known Unknowns" are the knowledge they correctly understand they lack. "Unknown Knowns" represent the knowledge AI assistants are unaware they possess, and "Unknown Unknowns" represent the knowledge they are unaware they lack.

The understanding of AI assistants about their own knowledge can be divided through the knowledge quadrant. As shown in Figure 1, the knowledge quadrant can split all knowledge into four parts: Known Knowns, Known Unknowns, Unknown Knowns, and Unknown Unknowns. Known Knowns is crucial for the helpfulness and reliability of AI assistants, and we use IK-IK (I know I know) to represent this category. The more IK-IK knowledge there is, the more helpful the AI assistant becomes. Additionally, for AI assistants, it's crucial to perceive and express the limitations of their knowledge, which corresponds to Known Unknowns. We use IK-IDK to represent this type of knowledge, indicating "I know I don't know." Unknown Unknowns (represented by IDK-IDK, meaning "I don't know I don't know") and Unknown Knowns (represented by IDK-IK, meaning "I don't know I know") can lead to inaccurate or unhelpful outputs. To maintain truthfulness in AI assistants, we hope to enable them to recognize what they know and don't know, thus converting Unknown Knowns and Unknown Unknowns into Known Knowns and Known Unknowns.

Idk dataset: Determining what exactly the AI assistant knows and doesn't know

2 Top: The construction process of the Idk dataset. Bottom: The construction process of preference data. Green represents correct responses, red represents incorrect responses, blue represents templates for refusal to answer.

To let the AI assistant know what it knows and doesn't know, we try using a model-specific "I don't know" (Idk) dataset to align the AI assistant. The Idk dataset contains questions that a specific AI assistant knows and does not know the answers to. We build the Idk dataset based on a popular knowledge-intensive open-domain question answering dataset, TriviaQA. Assessing whether a language model possesses certain knowledge is a difficult task, as the model has varying degrees of uncertainty about knowledge in different domains. When random sampling is activated, there is still a chance to guess the answer correctly, even if the model is highly uncertain about some questions. To define what the model knows and what it does not know, we adopt the common practice from previous work, which involves randomly sampling multiple responses for each question and calculating the accuracy of these responses. This accuracy can serve as the model's confidence level for that question. Then, we can define a specific confidence level as the threshold for judging whether the language model knows the answer to this question, which we call the Ik threshold. We use Lexical Matching to automatically evaluate whether the model's response is correct, that is, to check whether the ground truth appears in the model's response. According to the conclusions of previous work, on the validation set of TriviaQA, the Lexical Matching method can achieve a consistency rate of about 90% with human expert evaluations. Therefore, we believe that using Lexical Matching on the TriviaQA dataset can achieve accurate automatic evaluation results.

As shown in the upper part of Figure 2, we randomly sample ten responses for each question and then determine whether these responses are correct, obtaining the accuracy for each question. For questions with an accuracy higher than the given Ik threshold, we consider that the model knows the answer to this question and randomly select one of the model's generated correct answers as the annotated reply. Otherwise, we consider that the model does not know the answer to this question and use a template for refusing to answer as the annotated reply, which we represent in the figure with "I don't know." The lower half of Figure 2 shows the method we used to construct preference data. To construct preference data, we will first use half of the Idk data for SFT training, and then use the model trained with SFT to collect responses on the other half of the Idk data to construct pairs of preferences. Each preference pair is composed of a question, a chosen response, and a rejected response. For each question, we sample ten responses on the SFT model. For questions where the model knows the answer, as shown in the light green box in the figure, we use all correct responses generated by all models as the chosen responses, and the refusal template as the rejected response. For questions where the model does not know the answer, as shown in the light blue box in the figure, we use all incorrect responses generated by the model as the rejected responses, and the refusal template as the chosen response. For simplicity, we set the Ik threshold to 1.0. That is, for a question, only when all ten responses from the model are correct, do we consider that the model knows the answer to this question. We will discuss the effects of different values of the Ik threshold on the behavior of the model later.

Alignment: Letting the AI assistant know what it knows and what it doesn't know

After acquiring the Idk dataset, we can attempt to teach the AI assistant to perceive its own knowledge boundaries by alignment, so that it politely refuses to answer users' questions when it encounters knowledge it does not know. For the original AI assistant, such as a Llama-2-7b-chat model, we find that it does not exhibit obvious behaviors of refusing to answer questions it does not knowIn our search of the answers generated by Llama-2-7b-chat for keywords like "I don't know", "not sure", "Sorry", etc., we found that only a very few responses contained these keywords that indicate a refusal to answer.. It attempts to answer all questions to the best of its ability, striving to provide assistance to users. However, this also led to a significant number of factual inaccuracies. To ensure the AI assistant knows what it knows and what it doesn't, we attempt to use the Idk dataset for alignment. The simplest method is to directly instruct or use prompt words to prevent the AI assistant from answering questions it doesn't know. Secondly, we can directly use the Idk dataset to conduct SFT training. Furthermore, since we design and build preference data for learning to say "I don't know," we also utilize various preference optimization algorithms to further align the AI assistant. We list representative results of the original AI assistant, using prompts, models trained with SFT, and models trained with preference-aware optimization (DPO) in the figure below.

3 IK-IK: The AI assistant correctly answers the question; IDK-IK: The AI assistant knows the answer to the question but incorrectly refuses to answer; IDK-IDK: The AI assistant incorrectly answers the question; IK-IDK: The AI correctly refuses to answer a question it does not know. w/Idk-Prompting: For models that can follow instructions, directly using prompts can transform some of the IDK-IDK problems into IK-IDK problems. w/Idk-SFT: Idk-SFT allows AI assistants to refuse more questions they don't know, but it also makes the model more conservative, leading to the erroneous refusal of some questions they should know. w/Idk-DPO: Using preference-aware optimization, such as DPO, can alleviate the model's overly conservative problem and reduce the instances of incorrectly rejecting known questions.

As shown in Figure 3, directly instructing the model to refuse to answer questions it doesn't know through prompts is somewhat effective, but there are still a significant number of "IDK-IK" and "IDK-IDK" questions. After conducting SFT with the IDK dataset, the number of "IDK-IK" and "IK-IK" questions significantly decreased, indicating that the model's ability to perceive its own knowledge boundaries has been enhanced. However, we discover that SFT introduces an unexpected side effect, making the model more conservative, which leads to a reduction in the number of "IK-IK" questions. Yet, we further find that compared to SFT models, employing preference-aware optimization (such as DPO) can mitigate this phenomenon, encouraging the model to answer questions more frequently and reducing the instances of erroneously refusing to answer questions it knows.

4 The results of various alignment algorithms on the TriviaQA test set and two Out-of-Domain (OOD) test sets, Natural Questions and ALCUNA.

In Figure 4, we present more detailed experimental results, including the outcomes of various alignment algorithms such as SFT, DPO, BoN, PPO, HIR, etc. In addition to TriviaQA, we also select two additional datasets, Natural Questions and ALCUNA, as out-of-distribution test sets. This means that after training on the Idk dataset constructed based on TriviaQA, we directly test the model on Natural Questions and ALCUNA. Natural Questions is an open-domain question-answering dataset of the same type as TriviaQA, but it is more difficult than TriviaQA. In the test set of Natural Questions, only 24.65% of the questions are considered to be answerable by Llama-2-7b-chat. ALCUNA is a dataset for synthesizing knowledge, which includes some questions about creatures that do not actually exist, such as combining Alpaca and Vicuna to form Alcuna. Therefore, the model needs to refuse to answer all questions in this test set.

The number of IK-IK and IK-IDK questions contained in the test set can be approximately seen as an upper limit. TRUTHFUL represents the sum of the proportions of IK-IK and IK-IDK, because both IK-IK and IK-IDK questions are types of truthful answers and do not generate additional false information. Therefore, TRUTHFUL can represent the model's truthfulness. Simply using an "Idk" prompt to make the model refuse to answer questions it doesn't know can be somewhat effective, but the model's TRUTHFUL rate remains only at 66.93%, with a significant number of IDK-IDK questions. Idk-SFT can increase the TRUTHFUL rate to 74.75%, but it will lead to a decrease in the IK-IK rate. This is a side effect of SFT, which can be considered a kind of "alignment tax." Further, we find that preference-aware optimization can encourage models to answer questions more frequently, thereby mitigating such side effects. Preference-aware optimization algorithms like DPO, PPO, and BoN can reduce the decline in IK-IK while maintaining a relatively high IK-IDK rate. Idk-BoN achieves the highest TRUTHFUL rate, and Idk-HIR can improve the IK-IDK ratio but offers less help in increasing the IK-IK ratio. However, Idk-HIR provides a method for switching the Ik threshold without the need to retrain the model. In summary, by aligning the AI assistant with the Idk dataset (which represents its knowledge boundaries), we can transform IDK-IK and IDK-IDK questions into IK-IK and IK-IDK questions. The AI assistant is able to clearly perceive whether it knows the answers to most questions in the test set, and its accuracy significantly improves compared to before alignment.

Although TriviaQA itself cannot improve test set performance by fine-tuning on the training set (as the training set does not introduce the knowledge needed for the test set), we still introduce two different datasets as Out-of-distribution (OOD) tests, and models trained on TriviaQA also demonstrate good generalization ability on OOD test sets. We construct the Idk dataset (which only includes the test set portion) based on the Natural Questions using the same method, with the Ik threshold also set at 1.0. The results on Natural Questions are similar to those on TriviaQA; compared to simply using prompts, the trained model achieved a higher TRUTHFUL rate. On the ALCUNA test set, where all answers must be rejected, the model is also able to refuse to answer most questions.

Ablation: Factors that affect the AI assistant's perception of its own knowledge boundaries

In this section, we primarily analyze three factors that may influence an AI assistant's perception of its own knowledge boundaries: model size, data sources, and different Ik threshold values.

Effect of model size

The capabilities of large language models are generally related to their size in terms of parameters, with larger models often exhibiting stronger abilities. Therefore, to explore the impact of model size, we conduct Idk-SFT training on three different sizes of models: Llama-2-7b-chat, Llama-2-13b-chat, and Llama-2-70b-chat, to investigate how model size affects AI assistants' recognition of their own knowledge limitations. It is important to note that the label distribution of the Idk dataset varies across different models (the larger the model, the more IK-IK problems there are), which means that the IK-IK rate and the IK-IDK rate cannot be directly compared across models. Hence, our primary focus is on the TRUTHFUL rate of different models. The experimental results from Figure 5 indicate that the 13B model has a slightly higher TRUTHFUL rate than the 7B model. The TRUTHFUL rate of the 70B model is significantly higher than that of both the 13B and 7B models. This demonstrates that larger models are indeed better at distinguishing between what they know and what they don’t know.

Effect of data sources

Different pretrained models possess distinct knowledge due to their unique pretraining processes. During the training process, we construct model-specific Idk (I don't know) datasets for different pretrained models because we want the models to determine whether they know the answer to a question based on their internal knowledge, rather than learning to recognize questions with certain specific patterns. A model-specific Idk dataset can link the model's internal knowledge with the labels of the Idk dataset. To explore the impact of using non-model-specific Idk datasets on training, we construct two Idk datasets using Mistral-7B-Instruct-v0.1 and Baichuan2-7B-chat, named "Idk-Mistral" and "Idk-Baichuan," respectively. Experimental results from Figure 5 show that using non-model-specific Idk datasets, such as "Idk-Mistral" or "Idk-Baichuan," indeed leads to a decrease in the TRUTHFUL rate of the models. Since the Idk-Mistral and Idk-Baichuan datasets contain a large number of Idk questions, the trained models tend to reject answering questions more often, leading to a significant reduction in the number of IK-IK (I Know-I Know) questions, much lower than the proportion in the test set. This indicates that constructing model-specific Idk datasets is necessary to help models perceive what they know and what they do not.

Effect of Ik threshold

6 The label distribution of the Idk dataset for different Ik threshold values, as well as the changes in the IK-IK, IK-IDK, and TRUTHFUL rates.

Here, we discuss the impact of different Ik thresholds on model behavior. Our primary focus is on the effects of the Ik threshold on Idk-SFT and conducting experiments with Llama-2-7b-chat. The most direct impact of the Ik threshold is on the distribution of labels in the Idk dataset, where a higher threshold indicates that more questions will be marked as "I don't know." As shown in the left graph of Figure 6, the higher the threshold, the larger the proportion of Idk questions. This is because, at a high Ik threshold, only those questions that the model is very confident about will be marked as known by the model. As shown in the right graph of Figure 6, increasing the Ik threshold results in a decrease in the IK-IK rate and an increase in the IK-IDK rate. With the increase of the Ik threshold, the model's TRUTHFUL rate will continue to rise. In other words, setting a higher Ik threshold helps the model better distinguish between what it knows and does not know, making the model overall more truthful. Conversely, setting a lower Ik threshold can make the model more helpful, as the number of IK-IK questions will increase. Additionally, we find that as the proportion of Idk questions in the dataset increases, the model tends to refuse to answer questions more frequently.

Figure 7 The distribution of the knowledge quadrants after conducting SFT on the Llama-2-7b-chat model using Idk datasets corresponding to different Ik threshold values.

In Figure 7, we demonstrate the distribution of knowledge quadrants after performing Supervised Fine-Tuning (SFT) on Llama-2-7b-chat using the Idk dataset corresponding to different Ik threshold values. It can be observed that as the Ik threshold increases, the number of IK-IK issues decreases, the number of IK-IDK issues increases, and the overall TRUTHFUL rate rises.

Conclusion

In this work, we explore the question of "Can AI assistants know what they don't know?" We find that by aligning AI assistants, such as Llama-2-7b-chat, with a model-specific Idk ("I don't know") dataset that records what they know and don't know, AI assistants can largely identify the questions they do not know. In open-domain question-answering tests, Llama-2-7b-chat is able to accurately determine whether it knew the answer to 78.96% of the questions and refused to answer the questions it did not know. To achieve this, we explore various alignment strategies using the Idk dataset, including supervised fine-tuning and preference-aware optimization. Our analysis shows that the Ik threshold, which decides whether the model knows an answer to a certain question, affects the model's tendency to refuse to answer. Additionally, using a non-model-specific Idk dataset tends to lower performance. Employing models with a larger number of parameters, such as Llama-2-70b-chat, results in a higher TRUTHFUL rate. The ability of AI assistants to refuse to answer questions beyond their knowledge effectively reduces the model's factual errors and other hallucinations. We believe this is an important capability that a truthful AI assistant needs to have.

Acknowledgments

I especially thank Tianxiang Sun, Xiangyang Liu , Wenwei Zhang and other co-authors for their guidance and help. I really enjoy the teamwork with them.

Thanks to my advisor, Prof. Xipeng Qiu, for his guidance, support, helping me persevere and complete this work.

I am also grateful to Xinyang Pu for her support. I know we'll both make it through.