Riassunto analitico
With the growing adoption of Large Language Models (LLMs) in industrial applications, the need for reliable evaluation methodologies has become crucial. This thesis explores the key metrics, tools, and tasks involved in assessing the performance of LLM-based systems. The evaluation spans multiple dimensions, such as natural language generation (NLG) on both technical requirements (accuracy, calibration, robustness) and social requirements (fairness, bias, toxicity), and some more advanced tasks like retrieval-augmented generation (RAG). Traditional metrics, whose performances are measured as Pearson correlation with human-written references are reviewed but also newer LLM-as-a-judge are explored. In addition, a benchmarking toolkit has been developed to facilitate the systematic evaluation of LLM performance across various tasks. This toolkit integrates multiple evaluation methodologies, enabling users to test, compare, and analyze LLM-based systems efficiently. The findings of this study provide insights into best practices for LLM evaluation, aiming to enhance their reliability and applicability in industrial contexts.
|