Diagnosed by humans with the same error rate? How do we know that the baseline diagnoses were correct to begin with? If we're at the level of subtlety in which an AI system can better infer what's wrong that a person -- like a constellation of vague GI complaints rather than something obvious like a broken tibia -- is the baseline data deemed reliable enough to be worth comparing to?
Basically: we're comparing AI and humans against a model of scenarios that were created by humans. I dunno, I didn't dive too deeply into the study itself, but I'm always wary of data reliability.
And therein lies the problem. AI only does well with a canned question for which there is limited data leading to only one answer. Real life isn't like that.
So here's my question about that study: how did they determine what the "correct" diagnosis is to compare the test physicians vs ChatGPT?
post mortems, lol
They likely used already diagnosed cases as the scenarios.
Diagnosed by humans with the same error rate? How do we know that the baseline diagnoses were correct to begin with? If we're at the level of subtlety in which an AI system can better infer what's wrong that a person -- like a constellation of vague GI complaints rather than something obvious like a broken tibia -- is the baseline data deemed reliable enough to be worth comparing to?
Basically: we're comparing AI and humans against a model of scenarios that were created by humans. I dunno, I didn't dive too deeply into the study itself, but I'm always wary of data reliability.
***Thank you for saying this.*** I had the same exact thought.
Excellent point, DC.
And therein lies the problem. AI only does well with a canned question for which there is limited data leading to only one answer. Real life isn't like that.
Scary part is - what if they used ICD10 codes? We all know these are gamed to get insurance to pay.