Papers

Prompt2Box

Model prompts as axis-aligned hyperrectangles in n-d space.

PLAGUE: Plug-And-Play framework for life-long adaptive generation of multi-turn exploits

Agentic framework for discovering novel potent multi-turn jailbreak attacks that achieve an attack success rate of 67.3% on Claude Opus 4.1

Seemingly Plausible Distractors

This paper shows that LLMs struggle with challenging multi-hop reasoning, but in more subtle ways than the older gen PLMs. It does so by creating adversarial attack on the HotpotQA using dependency parsing.