Have you ever struggled with something that’s supposed to be easy to use? For me, it’s plastic wrappers, they’re designed to be simple to open, yet they always seem to outsmart me. Just yesterday, I spent at least 10 minutes wrestling with the plastic seal on a vanilla extract bottle before finally resorting to scissors.
Using Generative AI (GenAI) can sometimes feel the same way. It appears straightforward, just ask a question or have a conversation, but the outputs can include mistakes or hallucinations that can be hard to detect.
In the first article of this series, we explored the different types of hallucinations and why they matter. Now, let’s dive into how to identify and mitigate them effectively.
Strategies for Mitigating Hallucinations
While no GenAI model is currently immune to hallucinations, several strategies can significantly reduce their occurrence and impact.
- Prompt Engineering: Much like turning to scissors to handle “easy-to-open” plastic wrappers, better prompting can significantly improve the accuracy and effectiveness of AI responses.
- Few-Shot Learning: Few-shot learning is an in-context technique where the prompt provides structured examples to guide the model towards more precise and relevant responses. This method can significantly enhance accuracy by setting expectations for the format and logic of the response.
Let’s begin by not using the technique:
Prompt:
I want something to wash my cashmere sweater, which product should I use? These are the ones available: Blender Pro 2024 edition, Best Wash 3.0, and Rice Cooker Ultra. Output none if the task does not match.
Response:
None of the products you listed are appropriate for washing a cashmere sweater. […]
While this response seems acceptable at first glance, it overlooks that Best Wash 3.0 could potentially be the correct choice for washing clothes, including a cashmere sweater.
Now, let’s apply few-shot learning by adding structured context to the prompt:
Prompt:
I want something to wash my cashmere sweater, which product should I use? These are the ones available: Blender Pro 2024 edition, Best Wash 3.0, and Rice Cooker Ultra. Output none if the task does not match.
Q: What is helpful to make a smoothie? R: Blender Pro 2024 edition ###
Q: Which appliance can I use to cook rice? R: Rice Cooker Ultra ###
Q: What appliance is best to wash clothes? R: Best Wash 3.0.
Response:
Q: What is helpful to wash my cashmere sweater? R: Best Wash 3.0
This time, the model provided the correct answer, recognizing Best Wash 3.0 as the appropriate choice. It also adhered to the structured question-and-answer format provided in the prompt, producing a concise and clear response.
Now, let’s test this technique with a scenario where no appropriate option exists:
Prompt:
I want to buy an appliance from a store, which one should I buy? These are the ones available: Blender Pro 2024 edition, Best Wash 3.0, and Rice Cooker Ultra. Output none if the task does not match.
Q: What is helpful to make a smoothie? R: Blender Pro 2024 edition ###
Q: Which appliance can I use to cook rice? R: None ###
Q: What appliance is best to wash clothes? R: Best Wash 3.0. ###
Q: Which appliance should I use to bake a cake? R:
Response:
None
The model correctly concluded that none of the listed appliances are suitable for baking a cake. By including explicit examples in the prompt, few-shot learning not only improved the model’s accuracy but also ensured its adherence to the specified response format.
Chain of Thought Reasoning
This approach, introduced in a 2023 paper by Google, involves prompting the model to break down its “reasoning” into sequential steps. By encouraging this structured thinking, errors in reasoning-based tasks can be significantly reduced.
Let’s illustrate this with an example that doesn’t use the technique:
Prompt:
Where should I forward this email to? The choices are marketing@mycompany.com, ceo@mycompany.com, smith@mycompany.com, spam@mycompany.com.
Subject: Urgent
Body: Helo, it’s Smith. I lost my phone. Can you put me in touch with the CEO?
Response:
The best address to forward this email to is ceo@mycompany.com. […]
Unfortunately, this response is incorrect. The email is clearly a scam, notice the misspelling of “Helo,” a common red flag.
Now, let’s apply the chain-of-thought reasoning technique by adding “think step by step” to the prompt:
Prompt:
Where should I forward this email to, think step by step? The choices are marketing@mycompany.com, ceo@mycompany.com, smith@mycompany.com, spam@mycompany.com.
Subject: Urgent
Body: Helo, it’s Smith. I lost my phone. Can you put me in touch with the CEO?
Response:
Let me analyze this step by step:
1) First, check the sender’s claim: […]
2) Look for red flags: […]
3) Consider the risk: […]
Looking at the available forwarding options: […]
Conclusion: This email shows classic signs of a potential scam and should be forwarded to spam@mycompany.com for proper handling and investigation.
By simply adding “think step by step” to the prompt, the model produced a correct and thorough response. While this approach results in longer outputs, the improvement in accuracy is worth the extra detail.
Structured Prompting
This approach involves providing a clear template or outline for information formatting, guiding the model to extract and organize relevant data effectively. This technique is especially useful when dealing with lengthy or unstructured texts.
Prompt:
Extract the following information from the following text, Discovery information and Properties. Follow the format mentioned. Properties should be internal characteristics.
<copy the full text from wikipedia into this prompt, in this case: https://en.wikipedia.org/wiki/Blender>
Response:
Discovery Information:
The blender, also known as a mixer or liquidiser in British English, is a kitchen and laboratory appliance.
It was first designed by the Polish-American chemist Stephen Poplawski in 1919 and patented in 1922. […]
Properties (Internal Characteristics):
Stationary Blender: Consists of a blender container with a rotating metal or plastic blade at the bottom. […]
Immersion Blender: Hand-held device with a mixing head and rotating blades that can be immersed in any container. […]
General Properties: Features: large visible measurement marks, ease of use, low noise, power usage (300–1000 watts), ease of cleaning, pulse blending option. Applications: […]
This response demonstrates how the model accurately extracts and organizes relevant details using the provided template. The information is categorized under “Discovery Information” and “Properties,” ensuring clarity and adherence to the instructions.
By explicitly defining the output format, the model efficiently navigates complex or lengthy text, delivering concise and structured results.
This makes the technique particularly powerful for extracting specific insights or organizing data from expansive content sources.
Model Improvements
Model Updates
Regularly updating models to the latest version ensures access to improved performance, as LLMs are consistently refined. However, these updates often lack clear communication about what changes have been made.
So, when should a model be updated?
- When the model is getting deprecated: To avoid disruptions in functionality.
- When the knowledge cutoff is outdated: If the model’s training data does not meet the requirements of current use cases.
- When benchmarks for relevant tasks improve: Updated models often show better performance in specific applications.
Retrieval Augmented Generation (RAG)
Enhancing GenAI’s accuracy by integrating domain-specific data is a game-changer. One effective method is Retrieval-Augmented Generation (RAG), which involves two key phases:
- Document / Domain data upload by converting it into chunks then embeddings stored in a vector database
- Question answering: we get the LLM to answer the question by first sending the question to that vector database as embeddings to find the closest relevant chunk(s), including those chunks into a prompt template to generate a prompt, and finally submitting the prompt to the LLM.
The factual hallucination example of part 1 of this series was most likely caused by a GenAI model lacking up-to-date knowledge about Jupiter’s moons. To fix this, I can upload the relevant Wikipedia page to fill this gap for the model to accurately answer, “How many moons does Jupiter have?”
Fine-Tuning
General-purpose models, trained on broad datasets, may falter with specialized tasks. Fine-tuning involves adding targeted training and validation datasets to align the model’s output with specific requirements.
For example, fine-tuning ChatGPT requires preparing custom datasets, which can be implemented through OpenAI’s fine-tuning guide.
When to consider fine-tuning:
- If prompting fails to yield consistent results.
- If the problem is well-defined and supported by stable data and schema.
Model Routing
Some models excel in specific tasks more than others. Model routing leverages a system, such as an LLM-based router, to select the most appropriate model for a given prompt.
When the router is incorporated inside a workflow,
- Analyze the prompt to determine the user’s intent.
- Route the request to the most suitable model for that intent (GenAI model or not.)
- If no match is found, trigger a default behavior (akin to the default option of a “switch statement” in programming).
Reinforcement Learning with Human Feedback (RLHF)
RLHF is a feedback mechanism that integrates human assessments and prioritizes the alignment of GenAI outputs with user expectations.
How it works:
- Humans interact with models and provide feedback on satisfaction or alignment with GenAI responses.
- This feedback is used to improve future iterations of the GenAI model.
When using Google’s Gemini AI, users can rate responses with options like “Good suggestion” or “Bad suggestion.” This feedback loop helps Google continuously refine the AI’s performance based on user satisfaction.
By combining these techniques, model updates, RAG, fine-tuning, model routing, and RLHF, organizations can minimize inaccuracies and inefficiencies.
Verification and Oversight
Mistakes made by GenAI can be so monumental as to undermine the productivity gains from its use, especially when deployed autonomously in companies. Moreover, these errors can be challenging to detect, particularly at scale.
Here are key techniques for ensuring GenAI outputs are reliable and accurate:
Human-in-the-Loop (HITL)
Human-in-the-loop (HITL) is a straightforward concept. It involves assigning a person to review and verify the outputs of a GenAI system.
Most GenAI errors can be identified and corrected by a human reviewer.
Acting as a HITL does not require technical expertise in AI, just the ability to evaluate the accuracy and relevance of the AI’s output. As AI technology continues to evolve and its reliability improves, the need for HITLs may diminish over time.
Evaluation Datasets
As mentioned in the fine-tuning section, GenAI outputs can be fact-checked in real-time using algorithms or external databases. The most common approach is to compare outputs against an evaluation dataset.
This approach works best when the task is highly specific, as broader applications may lack a clear dataset for comparison.
LLM Self-Assessment
This method uses two LLMs to enhance oversight:
- The generator: Produces the answers.
- The critic: Reviews and critiques the generator’s output.
This division of roles mirrors chain-of-thought reasoning but separates tasks to enable independent and unbiased assessment.
Specialized Metrics
Even before the rise of GenAI, metrics like BLEU and ROUGE were widely used to evaluate the quality of generated text in Natural Language Processing (NLP). These methods are faster and more cost-effective than using an LLM to verify output and provide an additional signal for assessing validity.
Having multiple evaluation options is always beneficial.
The traditional metrics BLEU and ROUGE primarily focus on surface-level word overlap, measuring the precision of n-grams in the generated text compared to a reference. However, they don’t account for the underlying meaning of the text.
The BERT score, introduced in 2019, leverages a deep learning model (BERT) to better capture semantic meaning and similarity between texts. It compares a reference text with a candidate text, providing a more nuanced evaluation of semantic understanding.
Similarly, entailment models assess whether one sentence logically implies another, offering a sophisticated metric for evaluating outputs, especially when semantic accuracy is critical.
Combining multiple metrics ensures a more robust assessment of the AI’s outputs, especially against a reference material.
Disclaimer
Transparency is essential for building trust in GenAI outputs.
- Educate users: Provide clear disclaimers about the limitations of GenAI.
- Visual cues: Use markers (e.g., “AI-Generated”) to indicate when outputs lack thorough validation.
- Corporate practices: Encourage cautious interpretation of GenAI-generated content in decision-making.
By incorporating these techniques, HITL, evaluation datasets, self-assessment, specialized metrics, and clear disclaimers, organizations can detect hallucinations caused by GenAI. These practices ensure higher reliability, trust, and efficiency in leveraging AI systems.
The Way Forward: Responsible AI to Build Trust in GenAI
GenAI hallucinations are not merely technological hiccups. They underscore a fundamental truth: GenAI systems are tools, not autonomous experts.
Addressing hallucinations requires adopting responsible AI practices, emphasizing transparency, accuracy, and accountability. By combining robust prompt engineering, regular updates, and human oversight, we can transform GenAI from a risky innovation into a trusted partner.
As practitioners, our mission is to build AI systems that amplify human potential while managing their limitations responsibly.
This approach ensures not only the success of AI initiatives but also their long-term adoption and trustworthiness.
This article is the second in a series. Navigate to the other editions in this series below.
I. Hallucinating Machines: Understanding GenAI Errors
II. Hallucinating Machines: Strategies for Detecting and Mitigating GenAI Mistakes