G-Eval - NLG Evaluation using GPT-4 with Better Human Alignment

- LLM-based metrics have a potential issue of preferring LLM-generated texts over human written texts, which may lead to the self reinforcement of LLMs if LLM-based metrics are used as the reward signal for improving themselves. - G-EVAL is a prompt-based evaluator with three main components: 1) a prompt that contains the definition of the evaluation task and the desired evaluation criteria, 2) a chain-of-thoughts (CoT) that is a set of intermediate instructions generated by the LLM describing the detailed evaluation steps, and 3) a scoring function that calls LLM and calculates the score based on the probabilities of the return tokens. - Unlike G-EVAL, GPTScore formulates the evaluation task as a conditional generation problem instead of a form-filling problem. - G-EVAL-4 always gives higher scores to GPT-3.5 summaries than human-written summaries, even when human judges prefer human written summaries. - G-EVAL may have a bias towards the LLM generated summaries because the model could share the same concept of evaluation criteria during generation and evaluation. - The Effect of Chain-of-Thoughts We compare the performance of G-EVAL with and without chain-of-thoughts (CoT) on the SummEval benchmark. Table 1 shows that G-EVAL-4 with CoT has higher correlation than G-EVAL-4 without CoT on all dimensions, especially for fluency. This suggests that CoT can provide more context and guidance for the LLM to evaluate the generated text, and can also help to explain the evaluation process and results | | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Liu, Yang, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. "G-eval: NLG evaluation using gpt-4 with better human alignment." _arXiv preprint arXiv:2303.16634_ (2023). - [Link](https://arxiv.org/pdf/2303.16634) |