Leveraging Chatbots For Test Engineering

Wednesday, July 10, 2024

2:30pm - 2:45pm PDT

Location: Moscone South, Exhibition Level Room 9

Speaker (TV)(s)

Andy Kittross

Software Architect
Teradyne
North Reading, MA, United States

Session Slides

Public chatbots like ChatGPT are useful tools for answering general questions, but they fall short when it comes to providing specialized knowledge such as how to use Automatic Test Equipment (ATE). The models powering public chatbots were not trained on proprietary topics like test techniques, instrument specs, and exactly how to program ATE. Moreover, ChatGPT and other chatbots are known to “hallucinate” inaccurate answers in order to fill gaps in their knowledge.

Another concern with public chatbots is the security of Intellectual Property (IP). Chatbot providers typically use conversation content for retraining their language models, which has resulted in some spectacular leaks of confidential information by unwary business users. The last thing chip makers would want is for their proprietary test techniques to be leaked to their competitors by the chatbot vendor using conversation data to train future models.

How can these challenges around answer accuracy and data security be overcome?

Cloud-based chatbots can take advantage of world-class security capabilities where all data is encrypted in a private tenant within the cloud. Then it is up to the chatbot vendor to implement appropriate procedures to handle conversation data and feedback in a way that’s consistent with user confidentiality requirements. User feedback is essential for monitoring and improving chatbot quality, and with appropriate care, it can be done while respecting user IP and Personally Identifiable Information (PII).

Answer accuracy is trickier, but starts with the Retrieval Augmented Generation (RAG) chatbot pattern. This approach uses a Large Language Model (LLM) capability called “embeddings” to look up relevant reference documents in real time for each user query. These human-written documents fill in the gaps in the LLM’s knowledge and control hallucinations, among other benefits.

The basic RAG pattern is straightforward enough, but the devil is in the details. Compounding errors can be introduced at every stage, including document ingestion, document retrieval, and prompt engineering. What kind of testing can uncover such problems and help determine whether things are getting better or worse as the system evolves?

Achieving sufficient answer quality requires sophisticated testing approaches and adaptation of a plethora of quality improvement techniques from the community of researchers and AI engineers implementing RAG systems across every industry. This talk will describe many such techniques, along with the secondary benefits of RAG, and how repeatable test frameworks can be built to ensure chatbot quality before, during, and after release. This information can help those building their own chatbots, along with allowing chatbot users to understand the limitations of such systems and gain trust in what they can do well.