Several years ago, I developed a strange disease. Too much sitting and too much stress from my IT job took its toll. Describing the symptoms is very difficult. It feels like something is moving in my calves. It is not painful, but I'd rather be in pain than feel that strange sensation.
The doctors were clueless. They did all the tests. Ultrasound. Electromyoneurography. MRI of the lumbar part of the spine. The radiologist was having so much fun with me that he suggested I should also do an MRI of my brain.
Related: OpenAI makes shocking move amid fierce competition, Microsoft problems
I was looking for different opinions, and I never got a diagnosis. Not that specialists didn't have “great” ideas for experiments on me.
That is what happens when you don't have a run-of-the-mill disease.
Surprisingly, Microsoft, which isn't exactly known for being a medical company, may have a solution to finding the proper diagnosis, especially for difficult cases.
TheStreet/Shutterstock
Microsoft is working on AI models for healthcare
Dominic King and Harsha Nori, members of the Microsoft (MSFT) Artificial Intelligence team, blogged on June 30th about their team's activities.
According to them, generative AI has advanced to the point of scoring near-perfect scores on the United States Medical Licensing Examination and similar exams. But this test favors memorization over deep understanding, which isn't difficult for AI to do.
The team is aware of this test's inadequacy and is working on improving the clinical reasoning of AI models, focusing on sequential diagnosis capabilities. This is the usual process you go through with the doctor: questions, tests, more questions, or tests until the diagnosis is found.
Related: Analyst sends Alphabet warning amid search market shakeup
They developed a Sequential Diagnosis Benchmark based on 304 recent case records published in the New England Journal of Medicine.
These cases are extremely difficult to diagnose and often require multiple specialists and diagnostic tests to reach a diagnosis.
What they created reminds me of the very old text-based adventure games. You can think about each of the cases they used as a level you need to complete by giving a diagnosis.
You are presented with a case, and you can type in your questions or request diagnostic tests. You get responses, and you can continue with questions or tests until you figure out the diagnosis.
Obviously, to know what questions to type in, you have to be a doctor. And like a proper game, it shows how much money you have spent on tests. The goal of the game is to spend the least amount of money to give the correct diagnosis.
Because the game (pardon me, benchmark) is in the form of chat, it can be played by chatbots. They tested ChatGPT, Llama, Claude, Gemini, Grok, and DeepSeek.
Microsoft AI Diagnostic Orchestrator
To better harness the power of the AI models, the team developed Microsoft AI Diagnostic Orchestrator (MAI-DxO). It emulates a virtual panel of physicians.
MAI-DxO paired with OpenAI’s o3 was the most efficient, correctly solving 85.5% of the NEJM benchmark cases. They also evaluated 21 practicing physicians, each with at least 5 years of clinical experience.
These experts achieved a mean accuracy of 20%; however, they were denied access to colleagues and textbooks (and AI), as the team deemed such comparison to be more fair.
More Tech Stocks:
- Amazon tries to make AI great again (or maybe for the first time)
- Veteran portfolio manager raises eyebrows with latest Meta Platforms move
- Google plans major AI shift after Meta’s surprising $14 billion move
I strongly disagree with the idea that the comparison is fair. If a doctor is facing a difficult to diagnose issue and does not consult a colleague or refer you to a specialist, or look through his books to jog his memory, what kind of doctor is that?
The team noted that further testing of MAI-DxO is needed to assess its performance on more common, everyday presentations.
However, there is an asterisk.
I write a lot about AI, and I think it is just pattern matching. The data on which models have been trained is typically not disclosed. If o3 has been trained on NEJM cases, it's no wonder it can solve them. The same is true if it was trained on very similar cases.
Back to my issue. My friend, who is a retired pulmonologist, had a solution. Who'd ask a lung doctor for a disease affecting the legs?
Well, she is also an Ayurvedic doctor and a Yoga teacher. She thinks outside the box.
I was given a simple exercise that solved my problem. Years have passed, and if I stop doing it regularly, my symptoms return. What I know for sure is that no AI could ever come up with it.
Another problem is that even if this tool works, and doctors start using it, they'll soon have less than a 20% success rate on the “benchmark.” You lose what you don't use.