GitHub

Total: 1

For the deployment of large language models in realistic applicationsLLMIt is essential. When training and testing data are from the same domain, existing hallucinogenic detection methods can achieve strong performance, but they are less transversal. In this paper, we have studied an important but neglected issue, known as the broad hallucinogenic test (GHD), which aims to train hallucinogenic detectors based on data from individual domains while ensuring robust performance across different related fields. In looking at GHD, we simulated the multiple rounds of LLM ' s initial response and observed an interesting phenomenon: the multiple rounds of dialogue triggered by hallucinations generally showed greater volatility of uncertainty than that of fact in different fields. Based on this phenomenon, we have proposed a new score, Spike Score, which quantifys sudden fluctuations in multiple rounds of dialogue. Through theoretical analysis and empirical evidence, we have proved that Spike Score has achieved a strong inter-domain severability between hallucinations and non-phantural reactions. Experiments with multiple Master of Laws and Benchmarks have shown that SpikeScore-based testing is better than representative baselines in cross-cutting generic approaches and goes beyond advanced, broad-oriented approaches to validate our approaches in cross-cutting hallucination detection.