How to test AI | How to test LLMs?

4 min readFeb 15, 2025

Testing AI systems is not about finding defects, but more about reducing biasness and crafting well regarded responses. More specifically preparing it for real-world usage!

AI has taken tech industry through a storm of trying new possibilities. More precisely LLM (Large Language Models) have lots of ongoing concerns about the quality. Hence testing is a critical phase before releasing a AI product.

Some of the top issues I have identified during evaluations of some of the top AI models like DeepSeek, Gemini, GPT 4o, Claude-Sonnet:

Biasness: example- cultural biasness, data drift biasness
Authenticity: example- information provided may not be correct
Non-Deterministic: example- output changes even when input prompt is same
Incorrect Assumptions: example- no information about context used while preparing output

We will discuss in detail about the above mentioned issues, but let’s first understand what kind of testing systems can be used for evaluating LLM?

Metamorphic Testing
Adversarial Testing
Exploratory Testing
Red Teaming Tests

Metamorphic Testing (MT)

Identifies relationship between the input and output. As stated above, LLMs suffers through non-deterministic behavior which makes challenging to test them.
Example: Given an input we expect this output.

The above example kind of testing cannot be implemented for LLMs. Hence we need to find relationship between input & output. This process is known as metamorphic relationships. These relationships are logical mappings created between input & output. As when input changes the output should change with a appropriate predictable relationship.

If there are deviations than obviously we need to re-evaluate the AI algorithm, data used to train and other assumpations (parameters) used while generating the final output!

Example for metamorphic testing:

For effective testing using MT, consider following:

Identify relationships: Use inputs and match the output

Training data: Understanding the training data gives better context

Model understanding: Have context knowledge about “what is expected from the AI model?”

Adversarial Testing

Adversarial testing is a method for systematically evaluating an ML model with the intent of learning how it behaves when provided with malicious or inadvertently harmful input.

What are the key benefits of conducting Adversarial Testing?

Systematically improve models and products by exposing current failure patterns
Product decisions by assessing alignment to safety product policies and by mitigating them.

Prompting Techniques which can be used for Test Data Setup

Direct Prompting (Zero shot): is the simplest type of prompt. It provides no examples to the model, just the instruction. You can also phrase the instruction as a question

Provide:

Instruction
Some context

Example: Generate a list of Top tourist places in Germany and the foods to have there?

You can read my detailed analysis about Adversarial Testing here.

Red Teaming Tests

Red teaming is a security and testing practice can be used for generative AI systems. This approach involves a team of experts trying to identify and exploit vulnerabilities in the AI system to reveal potential risks and harms.

Primary goals are to prevent:

Data leakage
Robust against attacks

There can be tests which can utilise basic prompting to validate the AI system does not cause any potential reputation damage or shares internal data or leads to security vulnerabilities. We can call such tests as benign persona.

Now there can be adversarial persona: which can use techniques to break the system or cause AI to misbehave by using prompt hacking, token overflowing, introducing biasness etc.

Some of the errors or issues which can be found are listed below:

AI can expose operational mechanisms and programming used
Leaking personal user data
Accessing unintended sources
Hijacking AI to perform unauthorized actions
Biasness breaking societal norms

Exploratory Testing

With learning all the above practices, human knowledge can be used to build context, creativity, common-sense which cannot be possessed by AI systems.

We can build a cheatsheet to assess the AI systems like below:

Semantic consistency. The meaning remains consistent across different scenarios. Image or video output aligns with text input.

Style preservation. Output maintains original tone and style.
Factual consistency. Information remains accurate and truthful.
Contextual appropriateness. Response fits the context of the conversation.
Handle edge cases gracefully. The system handles unusual or extreme inputs without errors.

Conclusion

We can use and learn different techniques as mentioned above but the primary step before the execution should be understanding the intent of the AI system and the data used for its traning. This can build context and expectations for further analysis and testing

You can refer my AI course for QA & SDETs here: use this link to read more about the course

-x-x-