178 ページ、128 症例、医療分野における GPT-4V の総合評価、臨床応用と実際の意思決定にはまだ遠い-AI-php.cn

Shanghai Jiao Tong University & Shanghai AI Lab released a 178-page GPT-4V medical case review, comprehensively revealing the visual performance of GPT-4V in the medical field for the first time. Driven by large-scale basic models, the development of artificial intelligence has made great progress recently, especially OpenAI's GPT-4. Its powerful capabilities in question and answer and knowledge have lit up the Eureka moment in the AI field, causing widespread public concern. GPT-4V (ision) is OpenAI’s latest multi-modal base model. Compared with GPT-4, it adds image and voice input capabilities. This study aims to evaluate the performance of GPT-4V (ision) in the field of multi-modal medical diagnosis through case analysis. A total of 128 (92 radiology evaluation cases, 20 pathology evaluation cases and 16 positioning cases) were displayed and analyzed. Case) GPT-4V Q&A example with a total of 277 images (Note: This article will not involve case display, please refer to the original paper for specific case display and analysis).

178 ページ、128 症例、医療分野における GPT-4V の総合評価、臨床応用と実際の意思決定にはまだ遠い

GPT-4V medical image evaluation

ArXiv link: https://arxiv.org/abs/2310.09909

Baidu cloud download address: https://pan.baidu.com/s/11xV8MkUfmF3emJQH9awtcw?pwd=krk2

Google Drive download address: https://drive.google.com/file/d/1HPvPDwhgpOwxi2sYH3_xrcaoXjBGWhK9/view?usp=sharing

Assessment capabilities:

Image modality and imaging location identification: identify X-ray, CT, MRI Resonance imaging, ultrasound and pathology images, and localization of imaging locations.
Anatomical structure localization: Pinpoint specific anatomical structures in images.
Abnormal detection and localization: Detect and locate abnormalities such as tumors, fractures, or infections.
Multi-image comprehensive diagnosis: combine information from different imaging modalities or views for diagnosis.
Medical report writing: describe abnormal conditions and related normal results.
Patient history integration: Consider the patient’s basic information and medical history in image interpretation.
Consistency and memory in multiple rounds of interaction: Maintain continuity in data cognition.

Assessment system:

Central nervous system
Head and neck
Heart
Chest
Blood
Liver and gallbladder
Anorectal
Urology
Gynecology
Obstetrics
Breast Department
Musculoskeletal Department
Spine Department
Vascular Department
Oncology Department
Trauma Department
Pediatrics

Image modality:

X-ray
Computed Tomography (CT)
Magnetic Resonance Imaging (MRI)
Positive Electron Emission Tomography (PET)
Digital Subtraction Angiography (DSA)
Mammography
Ultrasound
Pathology
Test Case Selection

Radiology Q&A for the original paper comes from Radiopaedia, images are downloaded directly from the web page, positioning cases come from multiple medical public segmentation data sets, and pathology images come from PathologyOutlines. When selecting cases, the authors comprehensively considered the following aspects:

Publication date: Considering that the training data of GPT-4V is very likely to be extremely large, in order to avoid the selected test cases appearing in the training set, the authors only selected Latest cases published in 2023.
Annotation credibility: Medical diagnosis itself is controversial and ambiguous. Based on the case completion rate provided by Radiopaedia, the author tries to select cases with a completion rate greater than 90% to ensure the credibility of the annotation or diagnosis.
Image modality diversity: When selecting cases, the author tried his best to show the response of GPT-4V to multiple imaging modalities.

During image processing, the author also did the following normalization to ensure the quality of the input image:

Multiple image selection: Considering that the maximum image input limit supported by GPT-4V is 4, but some cases will have more than 4 related images, first of all, the author will try to avoid this situation when selecting cases, and secondly, if it is unavoidable When encountering such a case, the author will select the most relevant images based on the case annotations provided by Radiopaedia.
Section selection: A large amount of radiological image data is in the form of 3D (continuous multi-frame two-dimensional images) and cannot be directly input into GPT-4V. A most representative section must be selected to replace the complete 3D image and be input into GPT-4V. According to Radiopaedia's case upload specifications, radiologists are asked to select a most relevant section when uploading 3D images. The authors took advantage of this and chose the axial sections recommended by Radiopaedia for input instead of 3D data.
Image standardization: Standardized design of medical images, selection of window width and window level. Different windows will highlight different tissues. The authors used the Radiopaedio case to upload the window width and window level selected by the radiologist to input the image. For the segmented data set, the original paper uses a window of [-300,300] and performs case-level normalization of 0-1.

The tests of the original paper all used the web version of GPT-4V. In the first round of Q&A, users will input images, and then start multiple rounds of Q&A. In order to avoid mutual influence of context, for each new case, a new Q&A window will be created for Q&A.

178 ページ、128 症例、医療分野における GPT-4V の総合評価、臨床応用と実際の意思決定にはまだ遠い

GPT-4V Q&A case, red in the picture represents error, yellow represents uncertainty, and green represents correct. The color in Reference represents the basis for the corresponding judgment. The sentences without color mark need readers to judge the correctness by themselves. More cases Please refer to the original paper for case analysis
In pathological evaluation, all images will undergo two rounds of dialogue.

The first round asks if a report can be generated based on input images only.
The purpose of this round is to evaluate whether GPT-4V can identify image modality and tissue origin without providing any relevant medical hints.
In the second round, the user will provide the correct tissue source and ask GPT-4V whether it can make a diagnosis based on the pathological image and its tissue source information, hoping that GPT-4V can modify the report and provide a clear diagnosis result.
Pathological image case display

Location evaluation

Target recognition: Determine whether there is a target in the image.
Bounding box generation: Generate bounding box coordinates for the target, where the upper left corner is (0, 0) and the lower right corner is (w, h).
IOU calculation: Calculate the intersection-over-union ratio (IOU) between the predicted bounding box and the true bounding box.
Capped performance: Select the predicted bounding box with the highest IOU score.
Average Performance: Calculate the IOU score of the average bounding box.
評価の制限
もちろん、元の作成者は評価のいくつかの欠点と制限についても言及しました:
定量的評価ではなく定性的評価のみです
GPT-4V はオンライン Web インターフェイスのみを提供するため、テストケースは手動でのみアップロードできます。元の評価レポートは拡張性が限られていたため、定性的な評価しか提供できませんでした。
サンプルの偏り
選択されたサンプルはすべてオンライン Web サイトからのものであり、毎日の外来診療におけるデータ分布を反映していない可能性があります。特に、評価されたケースのほとんどは外れ値であり、評価に潜在的なバイアスが導入される可能性があります。
注釈や参考回答が不完全です
Radiopaedia または PathologyOutlines の Web サイトから入手した参考説明には、ほとんどの場合、構造がなく、標準化された放射線学/病理学レポートの形式がありません。特に、これらのレポートのほとんどは、症例の包括的な説明を提供するのではなく、主に異常の説明に焦点を当てており、完全な対応との直接の比較としては役立ちません。
2D スライス入力のみ
実際の臨床現場では、CT、MRI スキャンを含む放射線画像は通常 3D DICOM 形式です。ただし、GPT-4V は最大 4 つの 2D 画像の入力しかサポートできないため、元のテキストは評価中に 2D キースライスまたは小さなフラグメント (病理用) のみを入力できます。
結論として、評価は網羅的ではないかもしれませんが、この分析は依然として研究者や医療専門家に貴重な洞察を提供することができ、マルチモーダル基本モデルの現在の機能を明らかにし、基本モデルの構築に関する将来の研究にインスピレーションを与える可能性があると、原著者は信じています。医学の。
重要な所見
オリジナルの評価レポートには、評価ケースに基づいて観察された GPT-4V の複数の性能特性がまとめられています:
放射線科ケースセクション
著者は、92 件の放射線評価ケースと 20 件の位置決めケースに基づいて次の所見を作成しました:
GPT-4V医療画像のモダリティと撮影位置を識別できます
GPT4-V は、ほとんどの画像コンテンツについて、モーダル認識、撮影部位の決定、画像面のカテゴリの決定などのタスクに対して優れた処理能力を示しています。たとえば、著者らは、GPT-4V は MRI、CT、およびなどのさまざまなモダリティを簡単に区別できると指摘しました。
GPT-4V で正確な診断を行うことはほぼ不可能です
著者らは、一方で、OpenAI は、GPT-4V が直接診断を行うことを厳密に制限するセキュリティメカニズムを設定しているようです。非常に明白な診断ケースでは、GPT-4V の分析能力は低く、一連の可能性のある疾患を列挙することに限定されていますが、より正確な診断を行うことはできません。
GPT-4V は構造化されたレポートを生成できますが、内容のほとんどが間違っています
GPT-4V はほとんどの場合、より標準的なレポートを生成できますが、作成者は、より柔軟な内容で記述される傾向がある手書きレポートよりも統合されていると考えています画像ごとに処理するため、マルチモーダルまたはマルチフレーム画像を対象とする場合には包括的な機能が不足します。したがって、参考価値が少なく、正確性に欠ける内容がほとんどです。
GPT-4V は医療画像内のマーカーとテキスト注釈を識別できますが、画像内のそれらの外観の意味を理解することはできません。
GPT-4V は強力なテキスト認識、マーカー認識、その他の機能を示し、これらのマークを使用しようとします。分析用に。しかし、著者らは、その限界は次のとおりであると考えています。まず、GPT-4V は常にテキストとタグを過剰に使用し、画像自体が二次的な参照オブジェクトになってしまいます。第 2 に、GPT-4V は堅牢性が低く、画像内の医療情報を誤解することがよくあります。
GPT-4V は画像内の埋め込み型医療機器とその位置を識別できます
ほとんどの場合、GPT4-V は人体に埋め込まれた医療機器を正確に識別し、その位置を比較的正確に特定できます。そして著者らは、より困難なケースの一部でも診断エラーが発生する可能性があるにもかかわらず、医療機器は正しく識別されていると判断されたことを発見しました。
GPT-4V は複数の画像入力に直面すると分析の障害に遭遇します
著者らは、同じモダリティで異なる視点からの画像に直面する場合、GPT-4V は単一の画像を入力するよりも優れたパフォーマンスを示すが、それでもなお優れていることを発見しました。異なるモダリティからの画像の混合入力に直面すると、GPT-4V が異なるモダリティからの情報を統合する合理的な分析を導き出すことがより困難になります。
GPT-4V の予測は患者の病歴によって簡単に導かれます
著者らは、患者の病歴が提供されるかどうかが GPT-4V の答えに大きな影響を与えることを発見しました。病歴が提供されている場合、GPT-4V は画像内の潜在的な異常について推論するためのキーポイントとして使用することがよくありますが、病歴が提供されていない場合、GPT-4V は画像をキーポイントとして扱う可能性が高くなります。通常のケースが分析されます。
GPT-4V は医療画像内の解剖学的構造や異常を特定できません
著者らは、GPT-4V の位置決め効果の低さは主に以下の理由で現れると考えています: まず、GPT-4V は常に実際の境界から遠く離れた画像を取得します。位置決めプロセスの予測ボックス、第 2 に、同じ画像に対する複数ラウンドの繰り返し予測において顕著なランダム性が見られます。第 3 に、GPT-4V は明らかな偏りを示しています。たとえば、脳の MRI 画像では、小脳が下に位置している必要があります。
GPT-4V は、複数回のユーザーインタラクションに基づいて既存の回答を変更できます。
GPT-4V は、一連の対話にわたって応答が正しくなるように修正できます。たとえば、記事に示されている例では、著者は子宮内膜症の MRI 画像を入力しました。 GPT-4V は当初、骨盤 MRI を膝 MRI として誤分類し、不正確な出力をもたらしました。しかし、ユーザーは GPT-4V との複数回の対話を通じてそれを修正し、最終的には正確な診断を下しました。
GPT-4V には幻覚に関する深刻な問題があり、特に異常な信号が非常に明白である場合でも患者を正常であると説明する傾向があります。
GPT-4V は常に非常に完全で詳細な構造に見えるレポートを生成しますが、多くの場合、画像内の異常領域が明らかであっても、その内容は依然として正常であると見なされます。
GPT-4V は医療質問応答には十分に安定していません
GPT-4V は、一般的な画像とまれな画像の間に大きなパフォーマンスの違いがあり、また、異なる身体システムでは明らかなパフォーマンスの違いも示します。さらに、同じ医用画像の分析では、プロンプトが変化するため、一貫性のない結果が生成される可能性があります。たとえば、GPT-4V は、「この脳 CT の診断は何ですか?」というプロンプトの下で、特定の画像を異常であると判断します。通常と同じ画像を考慮してレポートします。この矛盾は、臨床診断における GPT-4V のパフォーマンスが不安定で信頼性が低い可能性があることを浮き彫りにしています。
GPT-4V には医療分野での厳しいセキュリティ制限があります
著者らは、GPT-4V が医療分野での Q&A での潜在的な誤用を防ぐための安全保護措置を確立し、ユーザーが安全に使用できることを確認しました。例えば、GPT-4Vは「この胸部X線写真の診断を教えてください」と診断を求められた場合、回答を拒否したり、「私は専門的な医学的アドバイスの代わりではない」と強調したりする場合がある。ほとんどの場合、GPT-4V は不確実性を表現するために「〜であると思われる」または「〜である可能性がある」を含むフレーズを使用することを好みます。
病理学ケースセクション
さらに、レポート生成と病理画像の医療診断における GPT-4V の機能を調査するために、著者らは、さまざまな組織からの悪性腫瘍の 20 枚の病理画像に対して画像ブロックレベルのテストを実施し、次のように結論付けました。結論:
GPT-4V は正確なモダリティ認識が可能です
すべてのテストケースにおいて、GPT-4V はすべての病理画像 (H&E 染色病理組織画像) のモダリティを正確に識別できます。
GPT-4V は構造化されたレポートを生成できます
GPT-4V は、医学的ヒントのない病理画像が与えられた場合、画像の特徴を説明する構造化された詳細なレポートを生成できます。 20 件中 7 件では、「組織構造」、「細胞の特徴」、「間質」、「腺構造」、「核」などの用語を使用して、観察結果を明確に、さらには正確にリストすることができました。