Using LLMs to Facilitate Formal Verification of RTL

cross-posted from: https://discuss.tchncs.de/post/3979328

Engineers in Princeton managed to train GPT4 and extend AutoSVA to generate SVA (systemverilog assertions) from buggy RTL and functionality description. SVA is widely used to verify digital design for ASIC and FPGAs. AutoSVA2, which extends open-source AutoSVA, improves the flow to generate SVA from English description. LLM was trained in multiple iterations to generate SVA with correct syntax, which is something GPT fails to do by itself. Authors argue that GPT’s “creativity” allows it to write correct assertion even from a buggy RTL. Later authors used this tool to write RTL from scratch as well. RTL written by GPT was tested against the SVA generated by this tool, and SVA corrected by an engineer was fed back to LLM, which generated functionally correct FIFO queue in a few iterations.

Abstract—Formal property verification (FPV) has existed for decades and has been shown to be effective at finding intricate RTL bugs. However, formal properties, such as those written as SystemVerilog Assertions (SVA), are time-consuming and error- prone to write, even for experienced users. Prior work has attempted to lighten this burden by raising the abstraction level so that SVA is generated from high-level specifications. However, this does not eliminate the manual effort of reasoning and writing about the detailed hardware behavior. Motivated by the increased need for FPV in the era of heterogeneous hardware and the advances in large language models (LLMs), we set out to explore whether LLMs can capture RTL behavior and generate correct SVA properties. First, we design an FPV-based evaluation framework that measures the correctness and completeness of SVA. Then, we evaluate GPT4 iteratively to craft the set of syntax and semantic rules needed to prompt it toward creating better SVA. We extend the open-source AutoSVA framework by integrating our improved GPT4-based flow to generate safety properties, in addition to facilitating their existing flow for liveness properties. Lastly, our use cases evaluate (1) the FPV coverage of GPT4-generated SVA on complex open-source RTL and (2) using generated SVA to prompt GPT4 to create RTL from scratch. Through these experiments, we find that GPT4 can generate correct SVA even for flawed RTL—without mirroring design errors. Particularly, it generated SVA that exposed a bug in the RISC-V CVA6 core that eluded the prior work’s evaluation.