Awesome-Multi-Turn-LLMs

A curated list of Papers, Datasets and Code Repositories for Multi-turn Interactions with Large Language Models. This repository compiles a majority of research works in the multi-turn LLM field, though it may not be fully exhaustive.

Our detailed survey of multi-turn LLMs, covering task types, improvement methods, and open challenges, is available here: Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models.

If you notice any missing research works or spot inaccuracies, feel free to reach out or open an issue. We also welcome submissions of multi-turn related work from everyone!

Audio demo: Play the survey audio

Multi-Turn LLM Tasks

Instruction Following Tasks

Instruction Following in General (Mixed)

Judging llm-as-a-judge with mt-bench and chatbot arena [NeurIPS 2023] [GitHub]
Bigbench: Towards an industry standard benchmark for big data analytics [SIGMOD 2013]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge [arXiv]
Measuring Massive Multitask Language Understanding [ICLR 2021]
Training verifiers to solve math word problems [arXiv]
AlpacaEval: An Automatic Evaluator of Instruction-following Models [GitHub]
Parrot: Enhancing multi-turn instruction following for large language models [ACL 2024]
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues [ACL 2024] [GitHub]
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [EMNLP 2024] [GitHub]
M2lingual: Enhancing multilingual, multi-turn instruction alignment in large language models [NAACL 2025] [Hugging Face]
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [arXiv] [GitHub] [Hugging Face]
Instruction-following evaluation for large language models [arXiv]
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs [ICLR 2025] [GitHub]
Fb-bench: A fine-grained multi-task benchmark for evaluating llms’ responsiveness to human feedback [EMNLP 2025] [GitHub]
Firm or Fickle' Evaluating Large Language Models Consistency in Sequential Interactions [Findings of ACL 2025] [GitHub]
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability [arXiv] [GitHub]
Wilt: A multi-turn, memorization-robust inductive logic benchmark for llms [arXiv] [GitHub]
WEBLINX: real-world website navigation with multi-turn dialogue [ICML 2024]
Can Language Models Follow Multiple Turns of Entangled Instructions' [Findings of EMNLP 2025]
SysBench: Can Large Language Models Follow System Messages? [ICLR 2025] [GitHub]
SAPIENT: Mastering Multi-turn Conversational Recommendation with Strategic Planning and Monte Carlo Tree Search [NAACL 2025]
Towards empathetic conversational recommender systems [Preprint]
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [Findings of NAACL 2025]
Teaching Language Models To Gather Information Proactively [Findings of EMNLP 2025]
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following [Findings of ACL 2025]
TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant' [Findings of EMNLP 2025]
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [arXiv]
TURNWISE: The Gap between Single- and Multi-turn Language Model Capabilities [arXiv]
IHEval: Evaluating Language Models on Following the Instruction Hierarchy [NAACL 2025]
Another Turn, Better Output': A Turn-Wise Analysis of Iterative LLM Prompting [arXiv]
Confidence Should Be Calibrated More Than One Turn Deep [ACL 2026]

Instruction Following in Math

Chain-of-thought prompting elicits reasoning in large language models [NeurIPS 2022]
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [arXiv] [GitHub]
Mathematical discoveries from program search with large language models [Preprint]
Let's verify step by step [ICLR 2023]
MathChat: Converse to Tackle Challenging Math Problems with LLM Agents [arXiv]
Zero-Shot Mathematical Problem Solving with Large Language Models via Multi-Agent Conversation Programming [AAAI Workshop 2024]
Building Math Agents with Multi-Turn Iterative Preference Learning [ICLR 2025]
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [ICLR 2024] [GitHub]
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems [Findings of EMNLP 2023] [GitHub] [Hugging Face]
SBSC: Step-by-Step Coding for Improving Mathematical Olympiad Performance [arXiv]
From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench [AAAI 2026]
Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors [NAACL 2025]
Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation [arXiv]
Beyond Final Answers: Evaluating Large Language Models for Math Tutoring [Preprint]

Instruction Following in Coding

Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step [Findings of ACL 2024] [GitHub]
Steering Large Language Models between Code Execution and Textual Reasoning [ICLR 2025] [GitHub] [Hugging Face]
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging [arXiv] [GitHub]
Intercode: Standardizing and benchmarking interactive coding with execution feedback [NeurIPS 2023] [GitHub]
What Makes Large Language Models Reason in (Multi-Turn) Code Generation' [ICLR 2025]
Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task [EMNLP 2018]
Program Synthesis with Large Language Models [arXiv]
Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system [arXiv]
Competition-level code generation with alphacode [Preprint]
TACO: Topics in Algorithmic COde generation dataset [arXiv]
PyBench: Evaluating LLM Agent on various real-world coding tasks [arXiv] [GitHub] [Hugging Face]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis [ICLR 2023] [GitHub] [Hugging Face]
Codegen2: Lessons for training llms on programming and natural languages [arXiv] [GitHub] [Hugging Face]
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [ICML 2025] [GitHub] [Hugging Face]
Opencodeinterpreter: Integrating code generation with execution and refinement [Findings of ACL 2024] [GitHub] [Hugging Face]
Executable code actions elicit better llm agents [ICML 2024] [GitHub] [Hugging Face]
Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types [arXiv] [GitHub]
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records [EMNLP 2024] [GitHub]
Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging [Findings of EMNLP 2024] [GitHub]
ClarifyGPT: Empowering LLM-based Code Generation with Intention Clarification [arXiv]
When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback [arXiv]
CONVCODEWORLD: Benchmarking Conversational Code Generation in Reproducible Feedback Environments [arXiv]
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation [arXiv]
A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback [arXiv]
Benchmarking Correctness and Security in Multi-Turn Code Generation [arXiv]
Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration [arXiv]

Instruction Following in Discussion

Judging llm-as-a-judge with mt-bench and chatbot arena [NeurIPS 2023] [GitHub]
Preference Leakage: A Contamination Problem in LLM-as-a-judge [ICLR 2026]
Does Context Matter' ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings [ACL 2025]

Conversational Engagement Tasks

Conversational Engagement in General (Mixed)

Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems [ACL 2023]
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [Findings of NAACL 2024]
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs [Findings of ACL 2025]
DialogBench: Evaluating LLMs as Human-like Dialogue Systems [NAACL 2024]
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants' [arXiv]

Conversational Engagement in Roleplay

A Persona-Based Neural Conversation Model [ACL 2016]
Exploring Personalized Neural Conversational Models [Preprint]
Personalizing Dialogue Agents: I have a dog, do you have pets too' [ACL 2018]
The oscars of ai theater: A survey on role-playing with language models [arXiv]
PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits [Findings of NAACL 2024] [GitHub]
Characterchat: Learning towards conversational ai with personalized social support [arXiv] [GitHub]
Better Zero-Shot Reasoning with Role-Play Prompting [NAACL 2024] [GitHub]
Pippa: A partially synthetic conversational dataset [arXiv] [Hugging Face]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations [EMNLP 2023] [GitHub]
PRODIGy: a PROfile-based DIalogue Generation dataset [Findings of NAACL 2024] [GitHub]
Chatharuhi: Reviving anime character in reality via large language model [arXiv] [GitHub]
CharacterGLM: Customizing Social Characters with Large Language Models [EMNLP 2024]
RoleCraft-GLM: Advancing Personalized Role-Playing in Large Language Models [Preprint]
Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment [ACL 2024] [GitHub]
Character-LLM: A Trainable Agent for Role-Playing [EMNLP 2023] [GitHub]
PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer [Preprint]
LLMs + Persona-Plug = Personalized LLMs [ACL 2025] [Hugging Face]
Neeko: Leveraging Dynamic LoRA for Efficient Multi-Character Role-Playing Agent [EMNLP 2024] [GitHub]
Instruct Once, Chat Consistently in Multiple Rounds: An Efficient Tuning Framework for Dialogue [ACL 2024] [GitHub]
Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning [EMNLP 2023]
Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations [COLING 2025]
LaMP: When Large Language Models Meet Personalization [ACL 2024] [GitHub]
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation [ACL 2024] [GitHub]
RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models [arXiv] [GitHub]
TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models [Findings of ACL 2024] [GitHub]
SimulBench: Evaluating Language Models with Creative Simulation Tasks [Findings of NAACL 2025] [GitHub] [Hugging Face]
InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews [ACL 2024] [GitHub]
SocialBench: Sociality Evaluation of Role-Playing Conversational Agents [Findings of ACL 2024] [GitHub]
Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works [EMNLP 2024]
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models [Findings of ACL 2024]
RAIDEN Benchmark: Evaluating Role-playing Conversational Agents with Measurement-Driven Custom Dialogues [COLING 2025]
RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing [Findings of EMNLP 2025]
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues [arXiv]
A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [arXiv]
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale [arXiv] [GitHub]
CharacterBench: Benchmarking Character Customization of Large Language Models [AAAI 2025]
OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas [arXiv]
Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning [arXiv]
RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following [arXiv]

Conversational Engagement in Healthcare

Huatuo: Tuning llama model with chinese medical knowledge [arXiv]
MING-MOE: Enhancing medical multi-task learning in large language models with sparse mixture of low-rank adapter experts [arXiv]
Doctorglm: Fine-tuning your chinese doctor is not a herculean task [arXiv]
A review on medical textual question answering systems based on deep learning approaches [Preprint]
Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt [arXiv] [GitHub]
HuatuoGPT, Towards Taming Language Model to Be a Doctor [Findings of EMNLP 2023] [GitHub]
Automatic interactive evaluation for large language models with state aware patient simulator [arXiv] [GitHub]
DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation [arXiv] [GitHub]
MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning [Preprint] [Hugging Face] [GitHub]
The AI Doctor Is In: A Survey of Task-Oriented Dialogue Systems for Healthcare Applications [ACL 2022]
Medical Dialogue System: A Survey of Categories, Methods, Evaluation and Challenges [Findings of ACL 2024]
Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [arXiv]
United States Medical Licensing Examination Sample Test Questions [USMLE]
PubMedQA: A Dataset for Biomedical Research Question Answering [EMNLP 2019]
What disease does this patient have' a large-scale open domain question answering dataset from medical exams [Preprint]
T-Agent: A Term-Aware Agent for Medical Dialogue Generation [Preprint]
Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning [EMNLP 2025] [GitHub]
BiMediX: Bilingual Medical Mixture of Experts LLM [Findings of EMNLP 2024] [Hugging Face] [GitHub]
CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling [Findings of ACL 2024] [GitHub]
PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation [IEEE TCSS 2025] [GitHub]
SMILE: Single-turn to Multi-turn Inclusive Language Expansion via ChatGPT for Mental Health Support [Findings of EMNLP 2024] [GitHub]
PsyQA: A Chinese Dataset for Generating Long Counseling Text for Mental Health Support [Findings of ACL 2021]
Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue [AAAI 2024] [GitHub]
Self-instruct: Aligning language models with self-generated instructions [ACL 2023]
Preliminary study on the construction of Chinese medical knowledge graph [Preprint]
HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [arXiv] [GitHub]
Aqulia-Med LLM: pioneering full-process open-source medical language models [arXiv] [Hugging Face]
Qilin-med: Multi-stage knowledge injection advanced medical large language model [arXiv]
ChiMed: A Chinese Medical Corpus for Question Answering [Preprint]
Benchmarking large language models on CMExam - a comprehensive chinese medical exam dataset [arXiv]
Towards conversational diagnostic artificial intelligence [Preprint]
Medgpteval: A dataset and benchmark to evaluate responses of large language models in medicine [arXiv]
An automatic evaluation framework for multi-turn medical consultations capabilities of large language models [arXiv]
Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System [COLING 2025] [GitHub]
Medfuzz: Exploring the robustness of large language models in medical question answering [arXiv]
Healthbench: Evaluating large language models towards improved human health [arXiv]
Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Diagnostic Reasoning [arXiv]
A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations [arXiv]
Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction [arXiv]
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors [arXiv]
Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation [arXiv]
MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-Turn Medical Consultations in Large Language Models [arXiv]
MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios' [arXiv]
MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare [arXiv]
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support [arXiv]
Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning [arXiv]
MEDPI: Evaluating AI Systems in Medical Patient-Facing Interactions [Preprint]

Conversational Engagement in Education

SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models [NeurIPS 2024] [Code]
Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging [Findings of EMNLP 2024] [GitHub]
Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure [Findings of ACL 2025] [GitHub]
One Size doesn't Fit All: A Personalized Conversational Tutoring Agent for Mathematics Instruction [arXiv]
A Step Towards Adaptive Online Learning: Exploring the Role of GPT as Virtual Teaching Assistants in Online Education [Preprint]
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems [Findings of EMNLP 2023] [GitHub] [Hugging Face]
Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching [Preprint] [GitHub]
Mathtutorbench: A benchmark for measuring open-ended pedagogical capabilities of llm tutors [EMNLP 2025] [GitHub]
CourseAssist: Pedagogically Appropriate AI Tutor for Computer Science Education [Preprint] [GitHub]
Designing Safe and Relevant Generative Chats for Math Learning in Intelligent Tutoring Systems [JEDM 2024]
Training LLM-Based Tutors to Improve Student Learning Outcomes in Dialogues [AIED 2025] [GitHub]
LearnLM is Google's new family of AI models for education [TechCrunch 2024]
Introducing Claude for Education [Anthropic 2025]
Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving [arXiv]
On Assessing the Faithfulness of LLM-generated Feedback on Student Assignments [EDM 2024]
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [EMNLP 2024]
Improving the validity of automatically generated feedback via reinforcement learning [AIED 2024]
Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions [EMNLP 2024]
Leveraging large language models to construct feedback from medical multiple-choice questions [Preprint]
LLM-generated Feedback in Real Classes and Beyond: Perspectives from Students and Instructors [EDM 2024]
LLM-Driven Feedback for Enhancing Conceptual Design Learning in Database Systems Courses [Preprint]
You're (Not) My Type-Can LLMs Generate Feedback of Specific Types for Introductory Programming Tasks' [arXiv]
On the effectiveness of LLMs for automatic grading of open-ended questions in Spanish [arXiv]
Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT [Preprint]
LLMs in Automated Essay Evaluation: A Case Study [AAAI 2024]
Generative Students: Using LLM-Simulated Student Profiles to Support Question Item Evaluation [Preprint]
Personality-aware Student Simulation for Conversational Intelligent Tutoring Systems [EMNLP 2024]
TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students [Preprint]
Simulating Classroom Education with LLM-Empowered Agents [NAACL 2025]
Exploring LLM-based Student Simulation for Metacognitive Cultivation [arXiv]
Exploring the potential of LLM to enhance teaching plans through teaching simulation [Preprint]
Book2Dial: Generating Teacher Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots [Findings of ACL 2024] [GitHub]
SAFETUTORS: Benchmarking Pedagogical Safety in AI Tutoring Systems [arXiv]
Simulated Students in Tutoring Dialogues: Substance or Illusion' [arXiv]
On the Effectiveness of Prompt-Moderated LLMs for Math Tutoring at the Tertiary Level [Findings of EMNLP 2025]
From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench [AAAI 2026]
Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors [NAACL 2025]
EduDial: Constructing a Large-Scale Multi-Turn Teacher-Student Dialogue Corpus [arXiv]
TeachLM: Post-Training LLMs for Education Using Authentic Learning Data [arXiv]
TutorBench: A Benchmark to Assess Tutoring Capabilities of Large Language Models [arXiv]
ConvoLearn: A Learning Sciences Grounded Dataset for Fine-Tuning Dialogic AI Tutors [arXiv]

Conversational Engagement in Jailbreak

Universal and transferable adversarial attacks on aligned language models [arXiv]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [ICLR 2024]
Are aligned neural networks adversarially aligned' [NeurIPS 2023]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts [arXiv]
Great, now write an article about that: The crescendo $$Multi-Turn$$$$LLM$$ jailbreak attack [arXiv]
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [arXiv] [GitHub] [Hugging Face]
Reassembling the social: An introduction to actor-network-theory [Preprint]
Emerging vulnerabilities in frontier models: Multi-turn jailbreak attacks [arXiv] [Hugging Face]
Speak out of turn: Safety vulnerability of large language models in multi-turn dialogue [arXiv]
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models [arXiv]
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [arXiv]
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking [arXiv] [GitHub]
HarmBench: a standardized evaluation framework for automated red teaming and robust refusal [ICML 2024]
When" competency" in reasoning opens the door to vulnerability: Jailbreaking llms via novel complex ciphers [arXiv] [GitHub]
On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning [ACL 2023]
Jailbreakbench: An open robustness benchmark for jailbreaking large language models [NeurIPS 2024]
Beavertails: Towards improved safety alignment of llm via a human-preference dataset [NeurIPS 2023]
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models [Findings of ACL 2022]
Cosafe: Evaluating large language model safety in multi-turn dialogue coreference [EMNLP 2024]
Chain-of-thought prompting elicits reasoning in large language models [NeurIPS 2022]
Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails [EMNLP 2023]
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs [EMNLP 2025]
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [Findings of EMNLP 2025]
Persona Jailbreaking in Large Language Models [arXiv]
A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios [arXiv]
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks [arXiv]
Many-Turn Jailbreaking [arXiv]
SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks [arXiv]
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability [Findings of EMNLP 2025]
The Echo Chamber Multi-Turn LLM Jailbreak [arXiv]
Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models [arXiv]
Multi-Turn Jailbreaks Are Simpler Than They Seem [arXiv]
Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search [ACL 2025]
Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors [arXiv]

Improvement Methods

Multi-Round Communication

In-Context Learning

A Survey on In-context Learning [EMNLP 2024]
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability [arXiv] [GitHub]
Judging llm-as-a-judge with mt-bench and chatbot arena [NeurIPS 2023] [GitHub]
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues [ACL 2024] [GitHub]
Intercode: Standardizing and benchmarking interactive coding with execution feedback [NeurIPS 2023] [GitHub]
Chain-of-thought prompting elicits reasoning in large language models [NeurIPS 2022]
Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [arXiv]
When” A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models [Findings of EMNLP 2024]
Characterchat: Learning towards conversational ai with personalized social support [arXiv] [GitHub]
Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure [Findings of ACL 2025] [GitHub]
GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt [AAAI 2025]
A State-Update Prompting Strategy for Efficient and Robust Multi-Turn Dialogue [arXiv]

Supervised Fine-Tuning

Training language models to follow instructions with human feedback [NeurIPS 2022]
Scaling instruction-finetuned language models [JMLR 2024]
AdapterDrop: On the Efficiency of Adapters in Transformers [EMNLP 2021]
Lora: Low-rank adaptation of large language models. [arXiv]
Instruction tuning for large language models: A survey [arXiv]
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [ICLR 2024] [GitHub]
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues [ACL 2024] [GitHub]
M2lingual: Enhancing multilingual, multi-turn instruction alignment in large language models [NAACL 2025] [Hugging Face]
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [arXiv] [GitHub]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations [EMNLP 2023] [GitHub]
PRODIGy: a PROfile-based DIalogue Generation dataset [Findings of NAACL 2024] [GitHub]
Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding [arXiv]
HuatuoGPT, Towards Taming Language Model to Be a Doctor [Findings of EMNLP 2023] [GitHub]
DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation [arXiv] [GitHub]
Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue [AAAI 2024] [GitHub]
An automatic evaluation framework for multi-turn medical consultations capabilities of large language models [arXiv]
Qilin-med: Multi-stage knowledge injection advanced medical large language model [arXiv]
Aqulia-Med LLM: pioneering full-process open-source medical language models [arXiv] [Hugging Face]
BiMediX: Bilingual Medical Mixture of Experts LLM [Findings of EMNLP 2024] [Hugging Face] [GitHub]
Book2Dial: Generating Teacher Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots [Findings of ACL 2024] [GitHub]
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems [Findings of EMNLP 2023] [GitHub] [Hugging Face]
Training LLM-Based Tutors to Improve Student Learning Outcomes in Dialogues [AIED 2025] [GitHub]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality [LMSYS 2023]
PlatoLM: Teaching LLMs in Multi-Round Dialogue via a User Simulator [ACL 2024]
Parrot: Enhancing multi-turn instruction following for large language models [ACL 2024]
Training Deep Nets with Sublinear Memory Cost [arXiv]
Flashattention: Fast and memory-efficient exact attention with io-awareness [NeurIPS 2022]
Chatglm: A family of large language models from glm-130b to glm-4 all tools [arXiv]
Fast transformer decoding: One write-head is all you need [arXiv]
Fine-tuning LLMs for multi-turn dialogues: optimizing cross-entropy loss with KL divergence for all rounds of responses [ICML 2024]
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [ICML 2025] [GitHub] [Hugging Face]
ConsistentChat: Building Skeleton-Guided Consistent Multi-Turn Dialogues for Large Language Models from Scratch [arXiv]
ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning [arXiv]
Data Selection for Multi-turn Dialogue Instruction Tuning [arXiv]
Prefix-Enhanced Large Language Models with Reused Training Data in Multi-Turn Medical Dialogue [Preprint]
DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities [arXiv]

Reinforcement Learning

Training language models to follow instructions with human feedback [NeurIPS 2022]
Constitutional AI: Harmlessness from AI Feedback [arXiv]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [arXiv]
Parrot: Enhancing multi-turn instruction following for large language models [ACL 2024]
Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training [ICLR 2025]
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues [ACL 2024] [GitHub]
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [ICLR 2024] [GitHub]
WEBLINX: real-world website navigation with multi-turn dialogue [ICML 2024]
SAPIENT: Mastering Multi-turn Conversational Recommendation with Strategic Planning and Monte Carlo Tree Search [NAACL 2025]
MathChat: Converse to Tackle Challenging Math Problems with LLM Agents [arXiv]
Intercode: Standardizing and benchmarking interactive coding with execution feedback [NeurIPS 2023] [GitHub]
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [ICML 2025] [GitHub] [Hugging Face]
DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation [arXiv] [GitHub]
Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue [AAAI 2024] [GitHub]
Qilin-med: Multi-stage knowledge injection advanced medical large language model [arXiv]
HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [arXiv] [GitHub]
Improving the validity of automatically generated feedback via reinforcement learning [AIED 2024]
Direct Multi-Turn Preference Optimization for Language Agents [EMNLP 2024]
Building Math Agents with Multi-Turn Iterative Preference Learning [ICLR 2025]
KTO: Model alignment as prospect theoretic optimization [ICML 2024]
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [ICML 2024]
Training Language Models to Self-Correct via Reinforcement Learning [ICLR 2025]
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF [ICLR 2025]
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models [ICML 2025]
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [arXiv]
Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction [arXiv]
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming [ACL 2025]

New Architectures

Faith and fate: Limits of transformers on compositionality [NeurIPS 2023]
Cached Transformers: Improving Transformers with Differentiable Memory Cache [AAAI 2024]
Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling [Preprint]
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context [ACL 2019]
Recurrent memory transformer [NeurIPS 2022]
HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing [NAACL 2025]
RWKV: Reinventing RNNs for the Transformer Era [Findings of EMNLP 2023]
Enhancing RWKV-based Language Models for Long-Sequence Text Generation [arXiv]

Agent-Based Approaches

Single-Agent Approaches

ReAct: Synergizing Reasoning and Acting in Language Models [ICLR 2023]
HotpotQA: A dataset for diverse, explainable multi-hop question answering [EMNLP 2018]
FEVER: a large-scale dataset for fact extraction and VERification [Preprint]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [ICLR 2021]
Webshop: Towards scalable real-world web interaction with grounded language agents [NeurIPS 2022]
Toolformer: Language Models Can Teach Themselves to Use Tools [arXiv]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face [arXiv]
Reflexion: language agents with verbal reinforcement learning [arXiv]
Voyager: An Open-Ended Embodied Agent with Large Language Models [arXiv]
AgentBench: Evaluating LLMs as Agents [ICLR 2024]

Multi-Agent Approaches

CAMEL: Communicative Agents for ''Mind'' Exploration of Large Language Model Society [arXiv]
ChatDev: Communicative Agents for Software Development [ACL 2024]
Self-collaboration code generation via chatgpt [Preprint]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework [ICLR 2024]
Multi-LLM Collaborative Search for Complex Problem Solving [arXiv]
Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning [ICLR 2025]
Improving Factuality and Reasoning in Language Models through Multiagent Debate [ICML 2024]
Generative agents: Interactive simulacra of human behavior [Preprint]
AutoAgents: A Framework for Automatic Agent Generation [arXiv]
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors [ICLR 2024]
LLM multi-agent systems: Challenges and open problems [arXiv]
Why Do Multi-Agent LLM Systems Fail' [arXiv]
Multi-agent risks from advanced AI [arXiv]

External Information Integration

Memory-Augmented Methods

Memory-assisted prompt editing to improve GPT-3 after deployment [EMNLP 2022]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory [ICLR 2025]
Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents [arXiv]
From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs [ICLR 2025]
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents [ACL 2025]
HyperMem: Hypergraph Memory for Long-Term Conversations [arXiv]
HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues [Preprint]
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation [arXiv]
A Persona-Aware LLM-Enhanced Framework for Multi-Session Personalized Dialogue Generation [Findings of ACL 2025]

Retrieval-Augmented Generation

Retrieval-augmented generation for knowledge-intensive nlp tasks [NeurIPS 2020]
Wizard of wikipedia: Knowledge-powered conversational agents [arXiv]
Internet-Augmented Dialogue Generation [ACL 2022]
Dense Passage Retrieval for Open-Domain Question Answering [EMNLP 2020]
Beyond Goldfish Memory: Long-Term Open-Domain Conversation [ACL 2022]
MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented generation systems [Preprint]
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation [Findings of NAACL 2025]
RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues [NAACL 2025]
DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue [arXiv]
Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA [arXiv]
CID-GraphRAG: Enhancing Multi-Turn Dialogue Systems through Dual-Pathway Retrieval of Conversation Flow and Context Semantics [arXiv]

Knowledge Graph Integration

Multi-turn Response Selection with Commonsense-enhanced Language Models [arXiv]
Integrating Large Language Models with Graph-based Reasoning for Conversational Question Answering [arXiv]
Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study [Preprint]
Wikidata as a knowledge graph for the life sciences [Preprint]
The Unified Medical Language System (UMLS): integrating biomedical terminology [Preprint]
Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation [arXiv]
Paths-over-Graph: Knowledge Graph Enpowered Large Language Model Reasoning [The Web Conf 2025]
GARLIC: LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph for Long Document QA [arXiv]
GNN-RAG: Graph Neural Retrieval for Efficient Large Language Model Reasoning on Knowledge Graphs [Findings of ACL 2025]
Pseudo-Knowledge Graph: Meta-Path Guided Retrieval and In-Graph Text for RAG-Equipped LLM [arXiv]
Improving LLM's Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization [arXiv]
D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree [arXiv]
GAP: Graph-Assisted Prompts for Dialogue-based Medication Recommendation [arXiv]
CID-GraphRAG: Enhancing Multi-Turn Dialogue Systems through Dual-Pathway Retrieval of Conversation Flow and Context Semantics [arXiv]

Open Challenges

In our survey paper on multi-turn interactions and tasks for large language models (LLMs), we categorize a wide range of tasks, including instruction-following scenarios and more complex conversational engagement tasks. To complement this, we also include an illustration highlighting key open challenges in this domain. If you're interested in the detailed improvement methods and a deeper discussion of the open challenges, please refer to our Full Paper.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
audio		audio
docs		docs
figs		figs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multi-Turn-LLMs

Table of Contents

Keywords Convention

Multi-Turn LLM Tasks

Instruction Following Tasks

Instruction Following in General (Mixed)

Instruction Following in Math

Instruction Following in Coding

Instruction Following in Discussion

Conversational Engagement Tasks

Conversational Engagement in General (Mixed)

Conversational Engagement in Roleplay

Conversational Engagement in Healthcare

Conversational Engagement in Education

Conversational Engagement in Jailbreak

Improvement Methods

Multi-Round Communication

In-Context Learning

Supervised Fine-Tuning

Reinforcement Learning

New Architectures

Agent-Based Approaches

Single-Agent Approaches

Multi-Agent Approaches

External Information Integration

Memory-Augmented Methods

Retrieval-Augmented Generation

Knowledge Graph Integration

Open Challenges

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages