Articles and Insights, Artificial Intelligence

Hallucinating Machines: Strategies for Detecting and Mitigating GenAI Mistakes

Class is in session.

Arnaud Lucas

Contributing CIO

Save

Have you ever struggled with something that’s supposed to be easy to use? For me, it’s plastic wrappers, they’re designed to be simple to open, yet they always seem to outsmart me. Just yesterday, I spent at least 10 minutes wrestling with the plastic seal on a vanilla extract bottle before finally resorting to scissors.

Using Generative AI (GenAI) can sometimes feel the same way. It appears straightforward, just ask a question or have a conversation, but the outputs can include mistakes or hallucinations that can be hard to detect.

In the first article of this series, we explored the different types of hallucinations and why they matter. Now, let’s dive into how to identify and mitigate them effectively.

Strategies for Mitigating Hallucinations

While no GenAI model is currently immune to hallucinations, several strategies can significantly reduce their occurrence and impact.

Prompt Engineering: Much like turning to scissors to handle “easy-to-open” plastic wrappers, better prompting can significantly improve the accuracy and effectiveness of AI responses.

Few-Shot Learning: Few-shot learning is an in-context technique where the prompt provides structured examples to guide the model towards more precise and relevant responses. This method can significantly enhance accuracy by setting expectations for the format and logic of the response.

Let’s begin by not using the technique:

Prompt:

I want something to wash my cashmere sweater, which product should I use? These are the ones available: Blender Pro 2024 edition, Best Wash 3.0, and Rice Cooker Ultra. Output none if the task does not match.

Response:

None of the products you listed are appropriate for washing a cashmere sweater. […]

While this response seems acceptable at first glance, it overlooks that Best Wash 3.0 could potentially be the correct choice for washing clothes, including a cashmere sweater.

Now, let’s apply few-shot learning by adding structured context to the prompt:

Prompt:

Q: What is helpful to make a smoothie? R: Blender Pro 2024 edition ###

Q: Which appliance can I use to cook rice? R: Rice Cooker Ultra ###

Q: What appliance is best to wash clothes? R: Best Wash 3.0.

Response:

Q: What is helpful to wash my cashmere sweater? R: Best Wash 3.0

This time, the model provided the correct answer, recognizing Best Wash 3.0 as the appropriate choice. It also adhered to the structured question-and-answer format provided in the prompt, producing a concise and clear response.

Now, let’s test this technique with a scenario where no appropriate option exists:

Prompt:

I want to buy an appliance from a store, which one should I buy? These are the ones available: Blender Pro 2024 edition, Best Wash 3.0, and Rice Cooker Ultra. Output none if the task does not match.

Q: What is helpful to make a smoothie? R: Blender Pro 2024 edition ###

Q: Which appliance can I use to cook rice? R: None ###

Q: What appliance is best to wash clothes? R: Best Wash 3.0. ###

Q: Which appliance should I use to bake a cake? R:

Response:

None

The model correctly concluded that none of the listed appliances are suitable for baking a cake. By including explicit examples in the prompt, few-shot learning not only improved the model’s accuracy but also ensured its adherence to the specified response format.

Chain of Thought Reasoning

This approach, introduced in a 2023 paper by Google, involves prompting the model to break down its “reasoning” into sequential steps. By encouraging this structured thinking, errors in reasoning-based tasks can be significantly reduced.

Let’s illustrate this with an example that doesn’t use the technique:

Prompt:

Where should I forward this email to? The choices are marketing@mycompany.com, ceo@mycompany.com, smith@mycompany.com, spam@mycompany.com.

Subject: Urgent

Body: Helo, it’s Smith. I lost my phone. Can you put me in touch with the CEO?

Response:

The best address to forward this email to is ceo@mycompany.com. […]

Unfortunately, this response is incorrect. The email is clearly a scam, notice the misspelling of “Helo,” a common red flag.

Now, let’s apply the chain-of-thought reasoning technique by adding “think step by step” to the prompt:

Prompt:

Where should I forward this email to, think step by step? The choices are marketing@mycompany.com, ceo@mycompany.com, smith@mycompany.com, spam@mycompany.com.

Subject: Urgent

Body: Helo, it’s Smith. I lost my phone. Can you put me in touch with the CEO?

Response:

Let me analyze this step by step:

1) First, check the sender’s claim: […]

2) Look for red flags: […]

3) Consider the risk: […]

Looking at the available forwarding options: […]

Conclusion: This email shows classic signs of a potential scam and should be forwarded to spam@mycompany.com for proper handling and investigation.

By simply adding “think step by step” to the prompt, the model produced a correct and thorough response. While this approach results in longer outputs, the improvement in accuracy is worth the extra detail.

Structured Prompting

This approach involves providing a clear template or outline for information formatting, guiding the model to extract and organize relevant data effectively. This technique is especially useful when dealing with lengthy or unstructured texts.

Prompt:

Extract the following information from the following text, Discovery information and Properties. Follow the format mentioned. Properties should be internal characteristics.

<copy the full text from wikipedia into this prompt, in this case: https://en.wikipedia.org/wiki/Blender>

Response:

Discovery Information:

The blender, also known as a mixer or liquidiser in British English, is a kitchen and laboratory appliance.

It was first designed by the Polish-American chemist Stephen Poplawski in 1919 and patented in 1922. […]

Properties (Internal Characteristics):

Stationary Blender: Consists of a blender container with a rotating metal or plastic blade at the bottom. […]

Immersion Blender: Hand-held device with a mixing head and rotating blades that can be immersed in any container. […]

General Properties: Features: large visible measurement marks, ease of use, low noise, power usage (300–1000 watts), ease of cleaning, pulse blending option. Applications: […]

This response demonstrates how the model accurately extracts and organizes relevant details using the provided template. The information is categorized under “Discovery Information” and “Properties,” ensuring clarity and adherence to the instructions.

By explicitly defining the output format, the model efficiently navigates complex or lengthy text, delivering concise and structured results.

This makes the technique particularly powerful for extracting specific insights or organizing data from expansive content sources.

Model Improvements

Model Updates

Regularly updating models to the latest version ensures access to improved performance, as LLMs are consistently refined. However, these updates often lack clear communication about what changes have been made.

So, when should a model be updated?

When the model is getting deprecated: To avoid disruptions in functionality.

When the knowledge cutoff is outdated: If the model’s training data does not meet the requirements of current use cases.

When benchmarks for relevant tasks improve: Updated models often show better performance in specific applications.

Retrieval Augmented Generation (RAG)

Enhancing GenAI’s accuracy by integrating domain-specific data is a game-changer. One effective method is Retrieval-Augmented Generation (RAG), which involves two key phases:

Document / Domain data upload by converting it into chunks then embeddings stored in a vector database
Question answering: we get the LLM to answer the question by first sending the question to that vector database as embeddings to find the closest relevant chunk(s), including those chunks into a prompt template to generate a prompt, and finally submitting the prompt to the LLM.

The factual hallucination example of part 1 of this series was most likely caused by a GenAI model lacking up-to-date knowledge about Jupiter’s moons. To fix this, I can upload the relevant Wikipedia page to fill this gap for the model to accurately answer, “How many moons does Jupiter have?”

Fine-Tuning

General-purpose models, trained on broad datasets, may falter with specialized tasks. Fine-tuning involves adding targeted training and validation datasets to align the model’s output with specific requirements.

For example, fine-tuning ChatGPT requires preparing custom datasets, which can be implemented through OpenAI’s fine-tuning guide.

When to consider fine-tuning:

If prompting fails to yield consistent results.

If the problem is well-defined and supported by stable data and schema.

Model Routing

Some models excel in specific tasks more than others. Model routing leverages a system, such as an LLM-based router, to select the most appropriate model for a given prompt.

When the router is incorporated inside a workflow,

Analyze the prompt to determine the user’s intent.

Route the request to the most suitable model for that intent (GenAI model or not.)

If no match is found, trigger a default behavior (akin to the default option of a “switch statement” in programming).

Reinforcement Learning with Human Feedback (RLHF)

RLHF is a feedback mechanism that integrates human assessments and prioritizes the alignment of GenAI outputs with user expectations.

How it works:

Humans interact with models and provide feedback on satisfaction or alignment with GenAI responses.

This feedback is used to improve future iterations of the GenAI model.

When using Google’s Gemini AI, users can rate responses with options like “Good suggestion” or “Bad suggestion.” This feedback loop helps Google continuously refine the AI’s performance based on user satisfaction.

By combining these techniques, model updates, RAG, fine-tuning, model routing, and RLHF, organizations can minimize inaccuracies and inefficiencies.

Verification and Oversight

Mistakes made by GenAI can be so monumental as to undermine the productivity gains from its use, especially when deployed autonomously in companies. Moreover, these errors can be challenging to detect, particularly at scale.

Here are key techniques for ensuring GenAI outputs are reliable and accurate:

Human-in-the-Loop (HITL)

Human-in-the-loop (HITL) is a straightforward concept. It involves assigning a person to review and verify the outputs of a GenAI system.

Most GenAI errors can be identified and corrected by a human reviewer.

Acting as a HITL does not require technical expertise in AI, just the ability to evaluate the accuracy and relevance of the AI’s output. As AI technology continues to evolve and its reliability improves, the need for HITLs may diminish over time.

Evaluation Datasets

As mentioned in the fine-tuning section, GenAI outputs can be fact-checked in real-time using algorithms or external databases. The most common approach is to compare outputs against an evaluation dataset.

This approach works best when the task is highly specific, as broader applications may lack a clear dataset for comparison.

LLM Self-Assessment

This method uses two LLMs to enhance oversight:

The generator: Produces the answers.

The critic: Reviews and critiques the generator’s output.

This division of roles mirrors chain-of-thought reasoning but separates tasks to enable independent and unbiased assessment.

Specialized Metrics

Even before the rise of GenAI, metrics like BLEU and ROUGE were widely used to evaluate the quality of generated text in Natural Language Processing (NLP). These methods are faster and more cost-effective than using an LLM to verify output and provide an additional signal for assessing validity.

Having multiple evaluation options is always beneficial.

The traditional metrics BLEU and ROUGE primarily focus on surface-level word overlap, measuring the precision of n-grams in the generated text compared to a reference. However, they don’t account for the underlying meaning of the text.

The BERT score, introduced in 2019, leverages a deep learning model (BERT) to better capture semantic meaning and similarity between texts. It compares a reference text with a candidate text, providing a more nuanced evaluation of semantic understanding.

Similarly, entailment models assess whether one sentence logically implies another, offering a sophisticated metric for evaluating outputs, especially when semantic accuracy is critical.

Combining multiple metrics ensures a more robust assessment of the AI’s outputs, especially against a reference material.

Disclaimer

Transparency is essential for building trust in GenAI outputs.

Educate users: Provide clear disclaimers about the limitations of GenAI.

Visual cues: Use markers (e.g., “AI-Generated”) to indicate when outputs lack thorough validation.

Corporate practices: Encourage cautious interpretation of GenAI-generated content in decision-making.

By incorporating these techniques, HITL, evaluation datasets, self-assessment, specialized metrics, and clear disclaimers, organizations can detect hallucinations caused by GenAI. These practices ensure higher reliability, trust, and efficiency in leveraging AI systems.

The Way Forward: Responsible AI to Build Trust in GenAI

GenAI hallucinations are not merely technological hiccups. They underscore a fundamental truth: GenAI systems are tools, not autonomous experts.

Addressing hallucinations requires adopting responsible AI practices, emphasizing transparency, accuracy, and accountability. By combining robust prompt engineering, regular updates, and human oversight, we can transform GenAI from a risky innovation into a trusted partner.

As practitioners, our mission is to build AI systems that amplify human potential while managing their limitations responsibly.

This approach ensures not only the success of AI initiatives but also their long-term adoption and trustworthiness.

This article is the second in a series. Navigate to the other editions in this series below.

I. Hallucinating Machines: Understanding GenAI Errors

II. Hallucinating Machines: Strategies for Detecting and Mitigating GenAI Mistakes

Save

February 16, 2025

☀️ Subscribe to the Early Morning Byte! Begin your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.

☀️ Your latest edition of the Early Morning Byte is here! Kickstart your day informed, engaged, and ready to lead with the latest in technology news and thought leadership.

Why Machines Learn: The Elegant Math Behind Modern AI

Asking for a Friend: After 12 Years Consulting, Can I Re-enter the Corporate C-Suite?

Security Organizations at Risk as US Slashes Cyber Spending

156GB of Data Stolen by DragonForce in Belk Breach

As Premiums Shrink and Threats Grow, It’s Time to Rethink Cyber Coverage

Subscribe to Newsletters

Curated Content | Thought Leadership | Technology News

Subscribe to Newsletters

Register

Sign In

Hallucinating Machines: Strategies for Detecting and Mitigating GenAI Mistakes

Save

Strategies for Mitigating Hallucinations

Chain of Thought Reasoning

Structured Prompting

Model Improvements

Model Updates

Retrieval Augmented Generation (RAG)

Fine-Tuning

Model Routing

Reinforcement Learning with Human Feedback (RLHF)

Verification and Oversight

Human-in-the-Loop (HITL)

Evaluation Datasets

LLM Self-Assessment

Specialized Metrics

Disclaimer

The Way Forward: Responsible AI to Build Trust in GenAI

Enter your username and password to access premium features.

Enter your username and password to access premium features.

Digital Monthly

Digital Annual

Enter your username and password to access premium features.

Log In To Access Premium Features

Sign Up For A Free Account