Reinforcement Learning

Goal: Optimize Policy to Maximize Future Rewards

GRPO VS PPO

Screenshot 2025-07-17 at 4.47.01 PM.png

Policy Model
Reward Model
Reference Model
- KL Divergence (measures how different two probability distributions are)
Value Model
- lets the AI think about the long-term consequences of its actions
- tries to predict the overall value of being in a certain state (considers the future potential rewards)

PPO (Proximal Policy Optimization)	GRPO (Group Relative Policy Optimization)
- Generalized Advantage Estimation / GAE (helps the AI to figure out which actions actually contributed to its success over time)	Group Computation (can handle different groups of agents or situations, each with its own kind of specialized strategy / analyzes awards differently for each group)
For careful and precision required tasks	For flexible and complex tasks
Ex: financial market algorithms, medical tasks	Ex: auto-driving cars

Task 2: Multi‑Abnormality Classification

Phase 1 (ResNet3D)

Phase 2 (CT-CLIP)

Trials:

ResNet3D with BCELoss
1. Epoch 2 Summary | Train Loss: 0.3383 | Val Loss: 0.3199 | AUROC: 0.8657 | Accuracy: 0.1503 | Sensitivity: 0.4198 | Specificity: 0.9541
ResNet3D with FocalLoss and class weights
CT-CLIP pre-trained weights (freeze encoders) with FocalLoss and class weights
1. Epoch 10 Summary | Average Training Loss: 0.2971 Validation AUROC: 0.5869 | Accuracy: 0.0000 | Sensitivity: 0.5385 | Specificity: 0.5745
  1. Needs improved accuracy
CT-CLIP pre-trained weights (freeze encoders) with FocalLoss
1. Epoch 1 Summary | Average Training Loss: 0.1198 Validation AUROC: 0.6650 | Accuracy: 0.0969 | Sensitivity: 0.1429 | Specificity: 0.9620
2. Epoch 9 Summary | Average Training Loss: 0.1176 Validation AUROC: 0.6978 | Accuracy: 0.0612 | Sensitivity: 0.0989 | Specificity: 0.9678
3. Epoch 10 Summary | Average Training Loss: 0.1173 Validation AUROC: 0.6836 | Accuracy: 0.0867 | Sensitivity: 0.0251 | Specificity: 0.9938
CT-CLIP pre-trained weights (freeze encoders) with BCELoss
1. Epoch 2 Summary | Average Training Loss: 0.4419 Validation AUROC: 0.6819 | Accuracy: 0.1071 | Sensitivity: 0.0000 | Specificity: 1.0000
CT-CLIP pre-trained weights (unfreeze encoders) with BCELoss and scheduler (ReduceLROnPlatea)
1. Epoch 1 Summary | Avg Train Loss: 0.4577 | Avg Val Loss: 0.4447 | Val AUROC: 0.6777 | Accuracy: 0.1071 | Sensitivity: 0.0000 | Specificity: nan
2. Epoch 5 Summary | Avg Train Loss: 0.3386 | Avg Val Loss: 0.4253 | Val AUROC: 0.7575 | Accuracy: 0.0918 | Sensitivity: 0.1962 | Specificity: 0.9602
CT-CLIP pre-trained weights (unfreeze encoders) with BCELoss and class weights and scheduler (ReduceLROnPlatea)

Reinforcement Learning

GRPO VS PPO

Task 2: Multi‑Abnormality Classification

Trials:

Task 2: Multi‑Abnormality Classification