Ben Cohen-Wang

I'm a PhD student in EECS at MIT where I am fortunate to be advised by Aleksander Mądry. I received my BS in computer science from Stanford, and spent two great years at Robust Intelligence before starting my PhD.

I am interested in how we can develop AI systems that can be safely deployed. Outside of research, I enjoy running, climbing, skiing, tennis and volleyball.

For a full list of papers, see Google Scholar.

ContextCite: Attributing Model Generation to Context NeurIPS 2024

Benjamin Cohen-Wang*, Harshay Shah*, Kristian Georgiev*, Aleksander Mądry

Language models may need external information to provide a response to a given query. We would provide this information as context and expect the model to interact with it when responding to the query. But how would we know if the model actually used the context, misinterpreted anything, or made something up? We present ContextCite, a method for attributing statements generated by language models back to specific information provided in-context.

Paper Blog post #1 Blog post #2 Demo Code

Ask Your Distribution Shift if Pre-Training is Right for You TMLR 2025

Benjamin Cohen-Wang, Joshua Vendrow, Aleksander Mądry

Pre-training on a large and diverse general-purpose dataset and then fine-tuning on a task-specific dataset can be an effective approach for developing models that are robust to distribution shifts. In practice, this method helps significantly in some cases, but not at all in others. In this work, we characterize the failure modes that pre-training can and cannot address.

Paper Blog post #1 Blog post #2 Code

Learning to Attribute with Attention Preprint

Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Mądry

It can be helpful to pinpoint the in-context information that a language model uses when generating content (Is it using provided documents? Or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably.

Paper Demo Code

Using ContextCite for LLM reliability May 6, 2024

We use ContextCite to detect unverified statements and discover poisoned documents.

ContextCite: Attributing Model Generation to Context May 6, 2024

We present ContextCite, a method for attributing statements generated by language models back to specific information provided in-context.

How Can We Harness Pre-Training to Develop Robust Models? March 4, 2024

We explore a simple principle for harnessing pre-training to develop robust models.

Ask Your Distribution Shift if Pre-Training is Right for You March 4, 2024

We study the robustness benefits of pre-training and characterize failure modes that pre-training can and cannot address.

Attribute (or cite) statements generated by LLMs back to in-context information.

GitHub repository Hugging Face demo PyPI package

Efficiently attribute generated content to in-context information using attention.

GitHub repository Demo PyPI package