
Prompt2Box
Model prompts as axis-aligned hyperrectangles in n-d space.

Model prompts as axis-aligned hyperrectangles in n-d space.

Agentic framework for discovering novel potent multi-turn jailbreak attacks that achieve an attack success rate of 67.3% on Claude Opus 4.1

This paper shows that LLMs struggle with challenging multi-hop reasoning, but in more subtle ways than the older gen PLMs. It does so by creating adversarial attack on the HotpotQA using dependency parsing.