This paper empirically evaluates the ability of current Large Language Models (LLMs) to analyze macrofinancial coverage in IMF Article IV staff reports, using human economists' assessments as a benchmark. We test several GPT models on reports from 2016-2024, assessing their performance on both qualitative ratings and binary questions. Our findings indicate that the latest models can meaningfully assist economists, achieving an average accuracy of 71-75% on ratings and an average exact match rate of 76-81% on binary questions in 2024 across advanced GPT models. However, we find that LLMs tend to assign higher, less-dispersed ratings than human experts and struggle with open-ended questions that require deep contextual judgment. The paper provides quantitative evidence on current LLM accuracy in this domain, explores the drivers of its performance, and discusses key limitations such as optimistic bias.