SPC365

Large Language Models Versus Physicians in Cardiovascular Risk Stratification

Session: Sessão de Comunicações Orais 09 – Inteligência Artificial e tomada de decisão no risco cardiovascular e nos sistemas de saúde
Speaker: José Ferreira Santos

Congress: CPC 2026
Topic: N. E-Cardiology / Digital Health, Public Health, Health Economics, Research Methodology
Theme: 33. e-Cardiology / Digital Health
Subtheme: 33.4 Digital Health
Session Type: Comunicações Orais
FP Number: ---
Authors: Jose Ferreira Santos; Regina de Brito Duarte; Inês Mota; Rita Carvalheira Santos; Jose Maria Moreira; Joana Campos; Nuno André Silva; Bernardo Neves; Ricardo Ladeiras-Lopes; Francisca Leite; Helder Dores

<p style="text-align:start"><span style="font-size:medium"><span style="font-family:&quot;Times New Roman&quot;,serif"><span style="color:#000000"><strong><span style="color:black">Background:</span></strong>&nbsp;<span style="color:black">Large language models (LLMs) show promise in medical reasoning, but their reliability relative to practicing clinicians remains poorly characterized. Benchmarking models against the variability of real-world clinical judgment, not only against guidelines, is essential to define their role in practice.</span></span></span></span></p>

<p style="text-align:start"><span style="font-size:medium"><span style="font-family:&quot;Times New Roman&quot;,serif"><span style="color:#000000"><strong><span style="color:black">Objectives:</span></strong>&nbsp;<span style="color:black">This exploratory study compared 11 contemporary LLMs with a diverse cohort of practicing physicians for cardiovascular risk stratification, focusing on classification accuracy, inter-rater variability, and safety-critical errors.</span></span></span></span></p>

<p style="text-align:start"><span style="font-size:medium"><span style="font-family:&quot;Times New Roman&quot;,serif"><span style="color:#000000"><strong><span style="color:black">Methods:</span></strong>&nbsp;<span style="color:black">In this vignette-based benchmark of 11 LLMs and 8 physicians, we used 30 validated synthetic clinical vignettes requiring cardiovascular risk stratification. Eight physicians (3 Family Medicine, 3 Internal Medicine, 2 Cardiology), all with &gt;3 years&rsquo; specialty experience, independently classified vignettes into three ESC 2021 risk categories. Their performance contextualized that of 11 LLMs from six major families (GPT, Claude, Gemini, Llama, Grok, DeepSeek). Agreement with an expert-adjudicated gold standard was assessed using quadratic-weighted Cohen&rsquo;s kappa (&kappa;w). Inter-rater reliability was quantified with Gwet&rsquo;s AC2, and a majority-vote physician ensemble was constructed.</span></span></span></span></p>

<p style="text-align:start"><span style="font-size:medium"><span style="font-family:&quot;Times New Roman&quot;,serif"><span style="color:#000000"><strong><span style="color:black">Results:</span></strong>&nbsp;<span style="color:black">LLM agreement with the gold standard ranged from fair to moderate (&kappa;w 0.40&ndash;0.69). Individual physicians showed wider variability, with &kappa;w 0.15&ndash;0.93. Inter-rater reliability among physicians was moderate (AC2=0.44). The top-performing model, GPT-4o (&kappa;w=0.69), outperformed 7 of 8 individual physicians, but the pooled physician ensemble (&kappa;w=0.76) exceeded all LLMs. Error analysis revealed distinct safety profiles: physicians made no major two-level misclassifications (low/moderate vs very high risk), whereas several LLMs did so in 2&ndash;13% of cases.</span></span></span></span></p>

<p style="text-align:start"><span style="font-size:medium"><span style="font-family:&quot;Times New Roman&quot;,serif"><span style="color:#000000"><strong><span style="color:black">Conclusions:</span></strong>&nbsp;<span style="color:black">In this vignette-based benchmark, top-tier LLMs matched or exceeded the performance of most individual physicians but remained inferior to collective human consensus and were uniquely prone to rare yet critical two-level errors. These findings highlight both the promise and safety limitations of current LLMs and underscore the need for further validation before clinical use.</span></span></span></span></p>

Our mission: To reduce the burden of cardiovascular disease
Visit our site

LISBOA
Campo Grande 28 - 13º |
1700-093 Lisboa
(+351) 21 797 06 85
(+351) 21 781 76 30
secretariado@spc.pt

COIMBRA
Rua de Olivença 11 - 7º Piso,
Sala 701, 3000-306 Coimbra
(+351) 239 83 81 01
(+351) 239 83 81 02
(+351) 239 83 81 03
cncdc@spc.pt

PORTO
Rua do Campo Alegre
803, Sala 8, 4150 Porto
(+351) 22 606 07 44
delegacao-norte@spc.pt