GitHub

Total: 1

Detecting hallucinations in large-scale language models is an important open problem with important implications for safety and reliability. Existing hallucination detection methods perform strongly on question-answering tasks, but are less efficient on tasks that require inference. In this study, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in fields such as computer vision. Treating the next token prediction of a language model as a classification task allows the application of OOD techniques, provided appropriate modifications are made to account for structural differences in large-scale language models. We show that our OOD-based approach achieves robust accuracy in hallucination detection for inference tasks by generating single-sample-based detectors that require no training. Overall, our study suggests that reformulating hallucination detection into OOD detection provides a promising and scalable path toward language model safety.