On Thursday, OpenAI researchers unveiled CriticGPT, a new AI model designed to identify bugs in the code generated by ChatGPT. It aims to improve the process of making artificial intelligence systems behave in ways that humans want (called “approximation”) through Reinforcement Learning from Human Feedback (RLHF), which helps human reviewers make the models’ outputs linguistic majors (LLM) more precisely.
As described in a new research paper called “LLM Critics Help Catch LLM Bugs,” OpenAI created CriticGPT to act as an AI assistant for human trainers who review programming code created by the AI assistant ChatGPT. CriticGPT—based on the GPT-4 family of LLMS—analyzes code and points out potential errors, making it easier for people to spot errors that might otherwise go unnoticed. The researchers trained CriticGPT on a dataset of code samples with intentionally introduced errors, teaching it to recognize and flag various coding errors.
The development of CriticGPT involved training the model on a large amount of data containing intentionally introduced errors. Human trainers were asked to modify the code written by ChatGPT, introducing errors and then providing example feedback as if they had discovered these errors. This process allowed the model to learn how to identify and criticize different types of coding errors.
In experiments, CriticGPT demonstrated its ability to catch both injected errors and naturally occurring errors in ChatGPT output. The new model’s critiques were preferred by trainers over those created by ChatGPT itself 63 percent of the time involving natural glitches (statistics mentioned above). This preference was in part because CriticGPT produced fewer useless “bumps” and generated fewer false positives or hallucinatory problems.
The researchers also created a new technique they call Force Sampling Beam Search (FSBS). This method helps CriticGPT write more detailed code reviews. It allows researchers to adjust how thorough CriticGPT is in searching for issues, while also controlling how often it might create issues that don’t actually exist. They can adjust this balance depending on what they need for different AI training tasks.
Interestingly, the researchers found that CriticGPT’s capabilities extend beyond just code review. In their experiments, they applied the model to a subset of ChatGPT training data that had previously been rated as flawless by human annotators. Surprisingly, CriticGPT identified errors in 24 percent of these cases—errors that were later confirmed by human reviewers. OpenAI thinks this shows the model’s potential to generalize to non-coding tasks and highlights its ability to catch subtle errors that even careful human evaluation might miss.
Despite its promising results, like all AI models, CriticGPT has limitations. The model was trained with relatively short ChatGPT responses, which may not fully prepare it for evaluating longer and more complex tasks that future AI systems may handle. Additionally, while CriticGPT reduces confabulations, it does not eliminate them completely, and human trainers may still make labeling errors based on these spurious results.
The research team agrees that CriticGPT is more effective at identifying bugs that can be identified at a specific location within the code. However, real-world errors in AI results can often be spread across multiple parts of an answer, presenting a challenge for future model iterations.
OpenAI plans to integrate models similar to CriticGPT into its RLHF tagging pipeline, providing its trainers with AI assistance. For OpenAI, it is a step towards developing better tools for evaluating results from LLM systems that may be difficult for people to evaluate without additional support. However, the researchers caution that even with tools like CriticGPT, extremely complex tasks or responses can still prove challenging for human raters — even those aided by AI.