Dev/steven/jb eval#55

steven10a · 2025-11-18T18:30:06Z

Updating docs with the new multi-turn results
Also fixed an error in the roc_auc graphic generation that was causing different values from the tables

Copilot

Pull Request Overview

This PR fixes a bug in ROC AUC calculation and updates benchmark documentation with new multi-turn evaluation results. The key fix changes the ROC AUC computation from using np.trapz to the correct roc_auc_score function, which was causing discrepancies between the visualizations and reported metrics.

Corrected ROC AUC calculation method in visualization code
Refactored data extraction to use confidence scores instead of binary values
Updated benchmark metrics and latency results in documentation

Reviewed Changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated no comments.

File	Description
src/guardrails/evals/core/visualizer.py	Fixed ROC AUC calculation bug and improved score extraction to use confidence values
docs/ref/checks/jailbreak.md	Updated benchmark metrics, latency data, and ROC curve image reference

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-18T18:38:42Z

src/guardrails/evals/core/visualizer.py

 y_scores= []

 forresultinresults:
-ifguardrail_nameinresult.expected_triggers:
-expected=result.expected_triggers[guardrail_name]
-actual=result.triggered.get(guardrail_name, False)
+ifguardrail_namenotinresult.expected_triggers:
+logger.warning("Guardrail '%s' not found in expected_triggers for sample %s", guardrail_name, result.id)
+continue

-y_true.append(1ifexpectedelse0)
-y_scores.append(1ifactualelse0)
+expected=result.expected_triggers[guardrail_name]
+y_true.append(1ifexpectedelse0)
+y_scores.append(self._get_confidence_score(result, guardrail_name))

 returny_true, y_scores

+def_get_confidence_score(self, result: Any, guardrail_name: str) ->float:
+"""Extract the model-reported confidence score for plotting."""
+ifguardrail_nameinresult.details:
+guardrail_details=result.details[guardrail_name]
+ifisinstance(guardrail_details, dict) and"confidence"inguardrail_details:
+returnfloat(guardrail_details["confidence"])
+
+return1.0ifresult.triggered.get(guardrail_name, False) else0.0


Misinterpreting guardrail confidence as positive score
The new _extract_roc_data now feeds _get_confidence_score directly into the ROC calculator, so benign samples that the guardrail confidently classified as safe (e.g., flagged=False, confidence=0.95 in tests/unit/checks/test_jailbreak.py lines 270‑282) are now given a score of 0.95 and ranked as if they were very likely jailbreaks. The confidence field represents certainty in the decision that was made, not probability that the sample is positive, so ignoring result.triggered in _get_confidence_score inverts the ranking for the majority of benign data. This change makes the ROC curve and all downstream metrics grossly misleading because negative samples with high confidence are now interpreted as highly positive. The previous implementation at least plotted the actual binary decisions; after this commit the visualizations are outright wrong unless the score is conditioned on whether the guardrail fired (e.g., use confidence when triggered and 1 - confidence when not).
Useful? React with 👍 / 👎.

"confidence" may be a poor name for it, but our model returns 0-1 with 0 being not a jailbreak and 1 being a jailbreak. It is confidence that the content is a jailbreak. So this logic from codex is not correct.

steven10a added 2 commits November 18, 2025 13:27

Multi-turn Jailbreak eval results
59a2783

Delete old jailbreak ROC curve image
5489961

CopilotAI review requested due to automatic review settings November 18, 2025 18:30

Copilotstarted reviewing on behalf of steven10a November 18, 2025 18:30 View session

Copilotfinished reviewing on behalf of steven10a November 18, 2025 18:32

CopilotAI reviewed Nov 18, 2025
View reviewed changes

chatgpt-codex-connectorbot reviewed Nov 18, 2025
View reviewed changes

steven10a added 2 commits November 18, 2025 13:45

Update data description
2fd1740

Data description correction
c0d1866

gabor-openai self-requested a review November 19, 2025 18:07

gabor-openai approved these changes Nov 19, 2025
View reviewed changes

gabor-openai merged commit 779139c into mainNov 19, 2025
3 checks passed

gabor-openai deleted the dev/steven/jb_eval branch November 19, 2025 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev/steven/jb eval#55

Dev/steven/jb eval #55

Uh oh!

steven10a commented Nov 18, 2025

Uh oh!

CopilotAI left a comment

Uh oh!

chatgpt-codex-connectorbot left a comment

Uh oh!

chatgpt-codex-connectorbotNov 18, 2025

Uh oh!

steven10aNov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dev/steven/jb eval#55

Dev/steven/jb eval #55

Uh oh!

Conversation

steven10a commented Nov 18, 2025

Uh oh!

CopilotAI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

chatgpt-codex-connectorbot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connectorbotNov 18, 2025

Choose a reason for hiding this comment

Uh oh!

steven10aNov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants