The Art of Testing LLMs: Why It Matters, Common Data Challenges, and Proven Testing Strategies

May 09th, 2025

32 Mins

Watch Now

Listen On

Vyasaraj Padakandla (Guest)

Practice Head - Digital Assurance,
Canarys

Kavya (Host)

Director of Product Marketing,
LambdaTest

The Full Transcript

Kavya (Director of Product Marketing, LambdaTest) - Hi, everyone. Welcome to another exciting session of the LambdaTest XP Podcast Series. I'm your host Kavya, Director of Product Marketing at LambdaTest, and it's a pleasure to have you all with us today. We have got a truly fascinating topic today in our discussion, which is on the art of testing LLM models. Before we dive in, let me introduce you to our guest on the show, Vyasaraj Padakandla, Practice Head - Digital Assurance, Canarys.

For over two decades, Vyas has been at the forefront of QA transformation initiatives. He's led major testing efforts across multiple industries by embracing cutting-edge technologies, driving processes, as well as re-engineering and building capabilities across teams.

In today's session, we are going to unpack a variety of critical teams, starting with why testing LLMs is non-negotiable if we want our AI systems to be reliable and responsible. So before we get started, let me hand it over to Vyas so that he can walk us a bit and share about his journey in the quality assurance space. Vyas, over to you.

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Thanks, Kavya, thanks for introducing me. Hello folks, my name is Vyasaraj. I'll popularly go as Vyas in my friends' and peer circle. I have 20+ years of experience in various areas of the software industry, particularly 20 years in the software testing field with enormous experience in testing ERP products and ISVs across different domains, such as the banking sector, financial sector, capital markets, and healthcare and supply chain management.

In Canarys, I have been working with Canarys for the last three years and I'm heading as Canarys Digital Assurance Practice with a head force of 50+ resources. And we have about 15 ongoing projects across all domains. We also offer solutions as well as services in the software testing industry. And I'm leading the transformation of digital assurance in terms of from traditional testing to AI-driven testing. That's the exciting part that I'm handling at the Kenneris currently. So that's a smaller brief introduction about myself.

Kavya (Director of Product Marketing, LambdaTest) - Thank you, Vyas. That's a pretty impressive career and journey that you've had in the testing space. Yeah, I'm sure that you would be able to bring in a lot of those insights into today's conversation. So let's start with the first question itself, which is basically, what makes testing LLMs fundamentally different from testing traditional software applications?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Yes. So the biggest difference that we can say in testing the LLMs, than traditional software or regular testing, is the scope. In regular testing, you can draw a scope. Whereas in LLM testing, there is no scope. It will be difficult to draw a line and say this is the scope of the testing because of its challenges, it's a different nature, it's different in catering the services, and different in generative answers.

So it is always 30-40% more complex in terms of testing than the regular testing. So this is about the high level. But when we are talking about the areas where the testing is going to be difference is that regular testing is nothing but deterministic testing, where you know what the output should be. You can always determine the outputs of the regular testing. Whereas in LLM testing, it is a probabilistic testing.

The outputs may vary across the different warrants, making the reproducibility and the test cases validation very difficult. That is one area. Another area what we can say is that the defining the correctness when we are saying a defining the correctness in regular testing we know that 2 plus 2 is equal to a 4 that is the correct validity we can say when we are executing any test cases with an output saying 2 plus 2 is equal to 4 and if we output is 2 plus 2 is equal to 4 then we can say this is correct but it is not the case with LLM.

The LLMs will have multiple valid concepts, making it hard to define the correctness only objectively. So that is the biggest challenge. difference, what I can say is the difference and the challenge, also to define the testing, you know, between the LLMs and PEDLAR testing. Another testing is about how much we can use automation and how much we can use human evaluation.

Regular testing, the test cases can be automated with clear assertions. But in LLM testing, the human-in-the-loop evaluation for coherence, relevance, and bias assessment is pretty much required. That is one area where LLM is different than the regular testing. Handling the edge cases is another area where it differs.

Hedges are all predefined in the systematically generated regular test cases, meaning they are systematically tested in the regular testing, whereas in LLM testing, they are often discovered dynamically since the language inputs are infinite in variations. So that is one area.

As I said in the beginning, test coverage is another big area, which is what I discussed about debugging and root cause analysis in regular testing involving tracing logs, fixing the broken logic, or debugging the code execution. But in LLM, errors are harder to pinpoint since the model works as a black box with no explicit logic. So these are some of the differences between regular testing as well as LLM testing. There are many others, but these are the high-level or important ones that can be considered.

Kavya (Director of Product Marketing, LambdaTest) - Thank you so much, Vyas, for breaking that down. I think it has become pretty clear for our audience how essentially testing LLM is different from traditional software applications. And as you were saying it, I was also making note of it, what kind of challenges you sort of face. And as you said, it's more than challenges, it's the differences between the two types of testing that's basically, yeah.

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Yes. The challenge is in identifying what is the difference between the two. Because in the scenarios, it appears like both are same. But when we are on the front, when we are in the battlefield, the difference what we make out is different. It's positive.

Kavya (Director of Product Marketing, LambdaTest) - Absolutely. No, absolutely. And also what is interesting for me is what you mentioned about how it is important to keep humans in the loop when we have to automate our LLMs. Whereas when you are automating regular test cases, that's pretty much automated. So this question about AI and humans coexisting, I mean, that has basically been answered through this question itself.

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Right. Right.

Kavya (Director of Product Marketing, LambdaTest) - Great. You know, moving on to the next question, which is based on these challenges, right? What framework do you recommend for creating comprehensive LLM testing strategies from scratch?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Yes, see, the LLMs are deferred, usage of LLMs differs from company to company, organization to organization, and business to business. So, having said that, the framework can also be unique according to the needs of each and every organization, every team, and every type of application they work on.

Yeah, having said that, all these requirements will keep on changing from organization to organization, application to application, domain-specific applications, nature-specific, everything. My recommendation would be any framework that is built on five pillars. The five pillars, our framework, should have a purpose and a scope defined. When I'm saying a purpose and scope are defined, what is the objective of this whole framework that should be used for case identification?

What should be my success criteria rate? What are the thresholds? When I'm saying threshold, is it a factual accuracy threshold, hallucination threshold, or bias threshold? All this threshold has to be clearly defined in our framework. And what types of testing are required in this frame for testing any application?

For example, functional, non-functional, adversarial, as well as exploratory testing or performance testing, etc. And other than this, what kind of domain I catering to? For example, if I'm building my application using LLMs for the medical field, whether my LLM is HIPAA compliant or not, if I'm creating any data-sensitive information, whether it is a GDPR component or not, or if I'm building any AI regulations, whether my LLM is catering to the compliance according to the nation.

For example, if I'm talking about the US. In the US, for example, let's say I'm talking about a California customer, whether it is going to use the California-specific compliances or not. So these kind of purposes has to be clearly defined in the testing framework before we commence the framework. That is first.

The second part would be the creation of test dimension matrix. What is this test dimension matrix? These dimensions are nothing but the areas which you are going to test in your LLM. For example, functional is one dimension. Accuracy is one dimension. Safety and ethics is one dimension of security, performance, user experience, all these are the dimensions. And these dimensions should be clearly defined what is your testing focal point. For example, if the functional testing is there, then what kind of testing focus should be? It should be on prompt outputs, logic flows, basic correctness.

If accuracy is the dimension which you are testing, the tester's focal point should be factual correctness, how hallucinations are being treated over there. So all these things have to be connected. That is how we have to build up all of our tested dimensional matrix based upon our application requirements. Then the third is infrastructure and tooling. What type of infrastructure and tooling?

What kind of prompt testing tooling do I have to use, whether I'm going to use open source, or I'm going to build my own prompt tooling, or I'm going to use licensed versions? Then what about my automation pipelines? How am I going to integrate my pipelines with other automation, continuous integration and continuous development?

What about the version control? How am I tracking my prompt and model versions? What is an evaluation dashboard, such as a KPS, to track my hallucination rates? What are my toxicity rates? What are my latency rates? All these things need to be defined well in advance. Then what are my evaluation methodologies? Whether it is a quantitative matrix, human evaluation, or a feedback loop, or AB and A and B testing.

So these are the evaluation methods. And then last but not the least, the governance monitoring regulation. What should be the audit logs? How audit should be happened? What should be my drift detection? And what is prompt of regression testing? What is my compliance checks with respect to specific compliance authorities or specific area air regulations? So all these things should be documented well before we start testing LLMs.

Kavya (Director of Product Marketing, LambdaTest) - Thank you so much, Vyas. I think that absolutely gives us more clarity on how to approach when it comes to creating that strategy in itself. And I think we have to, in conclusion, it's like we have to take care of multiple aspects, including the functional aspects, the challenges, in fact, as you mentioned, thinking about those adversarial-related challenges that would come in and ensuring that you're creating a more robust system in place or a foundation in place before you tackle it further. Moving on to the next question, how frequently should LLMs be retested after deployment, and what triggers should prompt additional testing in itself?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - So basically this frequency that depends on the type of application, how frequently your application is getting upgraded, how frequently you are introducing new prompts into LLMs, how frequently you are upgrading or enhancing your safety, and the ethical considerations. So all these things will matter. But typically, what we suggest here is that it should be a daily or a CI pipeline, continuous integration pipeline frequency.

The reason for this is that we can perform a smoke test of the critical prompts such as basic sanity, toxicity, and latencies, et cetera. These kind of things can be tested on daily basis so that we can test it and we can fix that issue. next thing, frequencies, they should be weekly or bi-weekly depending upon your frequency of your modifications prompt to rig, you know, this to have your regression should to be tested to test your drift detection and how you frequently you track your hallucination so that it is better to perform these kind of iterations weekly or bi-weekly. And in monthly iterations, we generally suggest to test accuracy, fairness, performance, prompt effectiveness, all these things.

And quarterly, it would be basically on, you know, audit and compliance checks. So these are the frequencies that we generally suggest, but other than these frequently, there will be a certain event triggered frequencies, we can say frequencies. So whenever there is an update to the model, that at that time we can perform a full testing. Whenever there is a changes to our prompt and the instruction changes, then we can do a changes. Whenever we added dot data or a knowledge base updates.

For example, when we are adding a pre-trained data to our LLM Sunday, everything, that time it is a good time to perform a testing for whole LLM models. And whenever we see a performance degradation, and I'm saying the performance degradation, when you ask for a question to the LLM and when you key in a prompt it is taking a lot of time for you, then you see that performance degradation has happened. So that is the time where we can plan for a robust performance testing.

And we can come up with the metrics and we can fix the challenges what has been arised. Frequently, we can do it for whenever we want to do a security or compliance events. So these are the regular frequencies where we can keep on our LLMs testing with various tools, techniques, and matrices for our evaluation.

Kavya (Director of Product Marketing, LambdaTest) - Thank you so much, Vyas. And what techniques work best according to you when it comes to identifying and mitigating biases in LLM test data sets?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Okay, so there are many techniques which we can adopt and techniques cannot be, it is not something predictive. It will be changing from the time to time or from the context to context. But broadly in LLM testing, what we perform is that one technique such as that bias audits using probing prompts.

For example, that when we are testing any gender kind of thing, for example, we ask nurse, the nurse said kind of thing, the race and ethnicity test we can perform, social economic statuses, nationality, and disability, et cetera. All these prompts can be audited using manual intervention kind of thing.

And the next technique what we can adopt is representation analysis. This is a quantitatively check for a demographic coverage and distribution of geography, gender, age, and cultural attributes to test it. And another important technique what we can do is the disparate output evaluation. That is one which we can do that. And then toxicity and bias scoring tools. are some important thing is the crowd source.

Nowadays crowd source, the SME-DV is one of the important technique because what happens in crowd size it will get a lot of ideas, lot of context into the testing board when we approach for the crowd. So that would be a good technique to test our LLMs. But of course, it would be practically a little bit complex in handling the testing of this technique. But it is one of the good testing technique method measures that can handle it properly.

Kavya (Director of Product Marketing, LambdaTest) - So that definitely gives us a lot of insights into how we can sort of mitigate the biases in itself. Moving on to the next question, what level of test automation is realistic for LLM testing today and which aspects still require human judgment? I think you primarily covered it in a way. Yeah.

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - So this is very controversial question because many industry people says that everything can be automated but I differ with that context. Particularly not only in LLMs but even in the conventional or regular testing aspects also everything cannot be automated. But in LLMs automation is having certain limitations.

We can automate those areas where we have we can do a repetitive job or it can be done without manual. For example, regression testing. Regression testing can be automated as well as the toxicity bias, safety scanning. This can be automated. Latencies, particularly when we are testing the performance, testing the latencies, token usages can be automated and formatting for conventional users.

For example, when you are validating any JSON files or XML files in terms of formats. So these formats can be automated because they will not be differing anything. So these are some of the areas where we can automate. But still, there are many areas where automation is not possible.

So those areas are nothing but you have no factual accuracy. When I'm saying factual accuracy, accuracy means it's about testing the hallucinations. Hallucinations, we cannot really automate it. It has to be tested with human intelligence and human manual only. Then fluency, coherence, and tone.

For example, if my tone is not correct, that can not be automated using automation scripts. It has to be done using manual intervention or any recording type of thing which can be used in a different, different tones, especially when I'm talking about phonetic languages like Mandarin, kind of thing. So that cannot be automated. It should be manually only.

The third thing is ethical and cultural sensitivity. These are the areas, creative use cases like storytelling, copywriting, these are the areas where we can do that. The thumb rule is that we can automate any areas that are quantitative, repeatable, and scalable checks. And we cannot automate those areas that have qualitative, contextual, and ethical reviews for humans.

Kavya (Director of Product Marketing, LambdaTest) - Understood, that makes a lot of sense. Yes. And I think it is also something that our QE leaders out there listening to this can also sort of implement within their organizations. And how should testing protocols be designed to specifically identify potential ethical issues in LLM outputs?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Alright. See, designing the testing protocols to identify the ethical issues in LLMs, outputs requires moving beyond traditional QA into the space that combines socio-technical awareness, socio-design, and human review. Having said that, there are three core principles to ethical LLM testing protocols.

One context is the king, second is inclusivity, and third is beyond correctness. When I'm saying the context is a king, an output can be ethical in one use case, but cannot be in another use. So the context here drives the same output. So in some cases, it will be right. Some cases, it is not right.

Okay, so that is why we say context is the king over there. Inclusively, testing must account to the diverse identities backgrounds of different, you know, authenticity and cultures. Third is that beyond correctness, not just right, it is right, it is safe, respectful, and fair. So these are the core principles on which the testing protocols should be defined.

Any protocol, the protocols can be based upon, again, the applications and everything. But these are three main core principles on which the protocol. But in my recommendation, when I say that, so how can I define the protocols? OK. Now define the ethical risk categories. That is one protocol.

Design the diverse probing prompting sets, that is another protocol based on that one. Use adversarial testing, that is another protocol which we can use. Apply a bias and tax toxicity scoring tools, that is another protocol. Human review panels, so these are the important protocols which can be applied on the three core principles, the contextual principles, which I was discussing.

Kavya (Director of Product Marketing, LambdaTest) - Thank you, Vyas. That definitely throws a lot of insights. on the same question itself, just wanted to ask a follow-up question. How should organizations or quality engineering teams align LLM testing practices with emerging regulatory requirements around AI reliability?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - See, it is very important that every company or every team aligns with the LLM testing practices with emerging EIA regulations. Because it has to build trustworthy, accountable systems that are ready for legal scrutiny and market expectations, and ethical responsibilities. So if you want to build your, you know, testing organizations to align with the testing practices. There are certain kinds of practices that we need to follow.

One such practice is that map your testing strategy to your AI risk categories. Create risk categories in your organization based upon the domains, based upon the application you do, and map the category testing strategy to that risking category. That is one's best way.

A practice best good practice is to integrate the testing integrate the testing into AI quality management systems, which are similarly to an ISO 9001 management. You can tailor-made your quality management systems. That is how we have done it in our organization. We are a 9001 ISO 9001 certified company.

And we do regular audits, QMS audits for all of our internal projects and services, and solutions that we offer. Now, for AI testing activities, we have added some of the questionnaires by picking up in alignment with all the AI regulations that, have added, we have tailor-made our QMS audit questionnaires in such a way that we can cater to audit AI tools also.

That is one way align with your regulatory testing requirements. This is one of the best practices that we can adopt. And include a specific testing for a compliance area such as bias, fairness, explainability, transparency, robustness, reliability, data privacy, all these things can be in are some of the best practices which can be adopted for any team or any organization for know kind of building the testing practices for testing elements.

Kavya (Director of Product Marketing, LambdaTest) - Thank you so much, Vyas. That definitely helps. And it was also an interesting question, primarily because a lot of teams are currently just starting to test LLMs. So they haven't started thinking about how to sort out the regulatory aspects of it. They haven't started thinking about it. So good that we discussed that today.

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Yeah. To answer that question, whatever we are discussing now is just a starting point. This is a starting step that we can take. And based on our application requirements, based on our business requirements, we can go. There is no thumb rule that we should follow this one only. That depends upon the organization to organization, circumstances to circumstances, domain to domain, how we are adopting it, all these things, which is the best way for us. This can be used as a first step in that direction.

Kavya (Director of Product Marketing, LambdaTest) - Yep, absolutely. And moving on to the next question, how should organizations structure their teams to effectively handle testing responsibilities?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - So it is again very, very subjective to every organization how they want to because every organization will have their own roles and responsibilities, hierarchies, defined everything. But if you ask me, my suggestion would be what we follow here, my suggestion would be structuring the LLM testing requires a multidisciplinary agile approach that blends the traditional QA with AI expertise.

That is what we are not deviating from our traditional QA. But we are adding our traditional QA with our AI expertise. That is the best. So based upon that, what I can say is that we need to have a testing test architect role. This test architect role is coupled with the traditional testing as well as with AI concepts.

So that he can bring up the AI testing concepts and he can introduce the AI testing and LLM testing methodologies or LLM testing everything in one thing. That is one thing. Then prompt engineers. This prompt engineers responsibility here what we have is that they design the prompt test cases, regression test prompts, test prompts, variants, everything. That is their responsibilities and they will use it, do it while they're testing.

The data QA analyst is one area. AI and ML engineers build or integrate the evaluation tools. Another role is that we have a set of reviewers who will do ethics and fairness reviews. Their responsibility is to ensure he quality what they conduct, including qualitative reviews for bias, stereotyping, safety, tone, and inclusivity. That is one thing. It is a completely manual rule. And something is a domain SME.

For example, if we are building any product or if you are building any application for a healthcare city, healthcare or BFSI or anything, we'll have a domain expertise who can give us more domain knowledge in terms of coming up and developing our training of the data, pre-trained data for our LLMs kind of thing. And of course, a compliance and risk officer.

So this would be kind of an ideal team. But again, if you have any, no bandwidth, then you can go for a red team, which are adversarial team members, who can probe into your model weaknesses via prompt injections. So these are the team structures that I would recommend to have for any organization or any team that wants to start with LLM testing.

Kavya (Director of Product Marketing, LambdaTest) - Super insightful. Thank you so much for sharing those tips and insights for the quality engineering managers who might be listening in. And we are on to the last question of the day, which is, what final advice would you give to organizations that are just beginning to implement rigorous LLM testing practices?

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - See, LLM or A is new in the market and we don't know what is the depth and breadth of that subject. So we cannot really fix a boundary. So what we have is that, let's start smart. Let's do small start and then we'll gradually increase it. So in this process, will learn, you will automatically identify where are the areas, how to go, how we can approach, how we can penetrate and how we can expand our testing ability, all these things.

So my take would be to go with a small start, but build a scalable foundations. It is very, very important to build a scalable foundations otherwise. Once we group our processes, our foundations may not support for the scalability. That's one thing. Again, it's going to be a general subject. It's not specific to any AI or LLM testing team.

Make it a team sport where all people work together. track what matters. Keep regular introspections. Align with the regulatory and ethical standards. Keep top on the regulatory enhancements or changes that are happening to the market or industry. So by that way, we can build all your organizations or teams in the LLM testing practices in the right direction.

Kavya (Director of Product Marketing, LambdaTest) - Thank you so much, Vyas. That's been a super enlightening answer as well as the session as a whole. It also helps that you have so many insights to share, not just on testing algorithms, but also how teams can actually function and utilize and implement the strategies in place. And of course, the real-world strategies that you have shared to navigate the challenges and reflections on how AI ethics can also give us so much to think about. It's a lot for the team.

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Yeah, the testing LLMs and artificial intelligence is very huge and enormous topic, and we cannot develop everything within a short span. So I covered what the important aspects that need to be addressed, which can be taken. Doesn't mean that whatever I said is just that that's it. There is enormous to that. So whenever you have any questions, feel free to ask me.

I'm here to answer all your questions. And thanks, Kavya, for giving me this opportunity and for taking me on this one and asking me the questions. In fact, this has really helped me to revisit my knowledge, to revisit my approach to testing the LLMs. And a couple of points have definitely been added after having this to my knowledge base, after having these sessions. So it was nice sessions and nice discussion with you, Kavya. Thanks for the opportunity.

Kavya (Director of Product Marketing, LambdaTest) - Thank you so much, Vyas. And to everyone who joined us, thank you for being a part of the XP Series. As Vyas mentioned, you can definitely get in touch with him. We'll be sharing his LinkedIn profile as well with you so that you can stay in touch and ask him any questions that you might have.

We hope you found this as informative and engaging as we did. So yeah, stay tuned for more episodes where we continue to explore how testing, innovation, and engineering excellence intersect. Until next time, take care and Keep testing. Thank you so much once again. Yes. Have a great day.

Vyasaraj Padakandla (Practice Head - Digital Assurance, Canarys) - Thanks, Kavya. Thank you all. Bye-bye.

Guest

Vyasaraj Padakandla

Practice Head - Digital Assurance

Vyasaraj Padakandla, Practice Head – Digital Assurance, Canarys Automations Ltd, brings 20 Years of diversified experience in Software Testing. Spearheading the QA Transformation Initiatives by adopting the latest and cutting-edge technologies, process reengineering, metrics-driven quality, and capability building in software testing. Leading a team of 50+ testers, handling 15+ accounts. My primary role is to oversee the entire practice and ensure efficient and effective testing processes that deliver high-quality software products and solutions, leading a team of over 50 QA engineers with diverse experience. Develop, implement, and maintain QA strategies, policies, and procedures to drive quality improvements and streamline testing workflows by adopting Agile and DevOps methodologies.

Host

Kavya

Director of Product Marketing, LambdaTest

With over 8 years of marketing experience, Kavya is the Director of Product Marketing at LambdaTest. In her role, she leads various aspects, including product marketing, DevRel marketing, partnerships, GTM activities, field marketing, and branding. Prior to LambdaTest, Kavya played a key role at Internshala, a startup in Edtech and HRtech, where she managed media, PR, social media, content, and marketing across different verticals. Passionate about startups, technology, education, and social impact, Kavya excels in creating and executing marketing strategies that foster growth, engagement, and awareness.