Training a Generally Curious Agent

Teaser image.

(Overview of PAPRIKA) We design a diverse set of tasks where an LLM agent needs strategic information gathering to succeed, then train an LLM on self-generated data to prefer higher performing trajectories. The resulting behavior learned by PAPRIKA can transfer zero-shot to unseen tasks, showcasing its potential to build general decision making agents.

Abstract

Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present PAPRIKA, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, PAPRIKA teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with PAPRIKA can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.


Problem Setting

The goal of our paper is to develop a scalable method to instill better strategic exploration and sequential decision-making capabilities into LLMs. Prior works (Krishnamurthy et al., 2024) have shown that LLMs can perform poorly on even the simple decision making task of multi-armed bandits. Nie et al., 2024 has since then demonstrated that LLMs can be taught to perform better on bandits after fine-tuning them on synthetic trajectories generated by known algorithms such as UCB. However, this idea is limited in scope for three reasons:

  • (1) We want LLMs to perform strategic exploration and decision making in more complex settings
  • (2) For most tasks, there is no known algorithm like UCB to generate good synthetic trajectories from
  • (3) It can be infeasible to collect data for all tasks that we care about

PAPRIKA

PAPRIKA aims to solve the above problems. First, we design a suite of complex decision-making tasks that require strategic information gathering to succeed. Next, we show that in the absence of known good algorithms, existing LLMs can generate trajectories with better decision making behaviors through diversity-encouraging sampling. We then finetune the LLMs to prefer higher performing trajectories (in a fashion similar to STaR) and show that this leads to better decision making abilities at test-time. More importantly, these behaviors often generalize to unseen task groups without additional training. Finally, we propose a general curriculum learning algorithm that can dynamically choose which subset of tasks to train on next to improve data efficiency of such training methods. We next describe each component of PAPRIKA.
paprika_task_groups

(Main Components of PAPRIKA) The PAPRIKA framework consists of task construction, task selection, generating good exploration behavior via diversity encouraging sampling from an LLM, and finally training the LLM on self-generated trajectories that attained high scores.

Task Design

To both evaluate and then train LLMs, we design 10 diverse textual task groups, each of which comprises of partially observable tasks that require multiturn interaction with the task environment, strategic exploration and good sequential decision making abilities for an agent to succeed at them. Below is a summary of these task groups.
paprika_task_groups
We generate 20 samples per each task in the training split, then collected all the successful trajectories to form our supervised fine-tuning dataset. Next, we took the best trajectory per task (successful and achieved success with the fewest number of turns) and one of the lower scoring trajectories (unsuccessful or successful but with a significantly more number of turns) to form a preference pair. Ultimately, we end up with 17,181 SFT trajectories and 5260 trajectory preference pairs.

Optimization

We use a multiturn variant of supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to fine-tune our models, where log probabilities are calculated autoregressively over the entire trajectory, but then only the log probabilities of the agent generated tokens are considered for calculating the training loss. In practice, we first run supervised finetuning, and then use the sum of SFT and DPO (similar to RPO) losses to optimize our LLMs.
optimization_objective

Scalable Online Curriculum Learning Algorithm

PAPRIKA's primary bottleneck lies in generating training trajectories, rather than model updates. So it is crucial that we generate more rollouts on tasks that have high learning potential.
However, it is hard to know which tasks would have high learning potential without generating rollouts first. To make progress, we make the additional assumption that similar tasks have similar learning potential, and that we have task similarity groups over our set of all tasks. Next, we use the coefficient of variation as a metric for measuring learning potential. Given a task group, we can sample one task from it, generate multiple trajectories for this task to calculate the coefficient of variation (COV) (over number of turns, which we use as a proxy for reward). We can then use this task COV to update our estimate of the task group's COV distribution, which can tell us which task group to sample from next if we want to maximize the COV of our sampled tasks.
learning_potential
Formally, once we have metric for measuring learning potential, we naturally want to maximize the learning potential of our sampled dataset, and we can treat this as multi-armed bandit (MAB) problem, for which we can employ the Upper Confidence Bound (UCB) algorithm to sample tasks from the collection of task groups.
curriculum_algorithm

Empirical Results

PAPRIKA improves LLM decision making abilties

Training on 10 diverse tasks groups result in performance improvement on heldout tasks from each group.
paprika_success_rate

(PAPRIKA improves success rate on a diverse range of task groups) Average success rate on 6 representative task groups, with shaded areas representing standard error over 3 random seeds. PAPRIKA improves performance on all of them after fine-tuning on only roughly 22,500 total trajectories.

PAPRIKA can teach LLMs strategies that generalize zero-shot to a new task group

The next important question we study is whether the strategies learned by PAPRIKA can zero-shot transfer to entirely different groups of tasks. To do so, we perform a set of leave-one-out (LOO) experiments: we randomly choose one group (e.g., 20 questions) from our set of environments, train the LLM on trajectories generated from every other group, and test the resulting model's performance on the left-out group. Our results show that PAPRIKA (LOO) outperforms the starting model, Llama-3.1-8B-Instruct, in 6 representative task groups. This shows that the strategic exploration taught by PAPRIKA is not tied to a particular environment, and scaling up the number of task groups in PAPRIKA can be a viable solution for teaching LLMs general in-context RL abilities.
paprika_generalization

(Testing generalization of PAPRIKA via leave-one-out experiments) We test PAPRIKA's zero-shot performance on unseen task groups by leave-one-out (LOO) experiments, where we train the LLM on trajectories from every task group except the group we test on. All evaluations are done at temperature 0.7 and we report average success rate. Our experiments demonstrate that PAPRIKA can teach an LLM sequential decision making abilities that often transfers well to new tasks without needing any additional training.

Curriculum Learning Can Improve PAPRIKA's Sample Complexity

We test our curriculum learning algorithm on 20 questions, using GPT-4o-mini defined easy, medium, and hard task group clustering of the hidden topics. Our curriculum outperforms uniform sampling over 3 rounds of iterative training in terms of both average and pass@4 success rate, showcasing its potential for improving the sample complexity of such methods, potentially also working with Online RL.
paprika_curriculum

(Multi-round training with curriculum on twenty questions) We demonstrate the efficacy of our curriculum learning algorithm for sampling training tasks by comparing its performance against uniform sampling for multi-round training. All evaluations are done at temperature 0.7, and shaded regions represent standard error over 3 seeds. (Left) Average success rate at each round. (Middle) Pass@4 success rate at each round. (Right) Success rate per each of easy, medium, and hard task groups. Overall, our curriculum learning algorithm shows 1.4% and 3.3% improvement over the uniform sampling baseline at average and pass@4 success rate respectively.

Example Trajectories

Here we list two example trajectories, one from the regular Llama-3.1-8B-Instruct model, and another from the PAPRIKA fine-tuned model, on the 20 questions task group. The hidden topic the agent has to guess is a concept, and the correct answer is primary numbers. We see qualitatively that PAPRIKA teaches the model to ask better quality questions.
paprika_example_task_trajectories

(Qualitative analysis of behaviors taught by PAPRIKA) PAPRIKA not only improves over the starting model in terms of success rate and other metrics, but also shows better sequential decision making abilities. Here we present two example trajectories on 20 questions, where the hidden topic the agent has to guess is a concept, and the correct answer is primary numbers. The PAPRIKA fine-tuned model asks much better quality questions and is able to guess the topic in 8 turns, whereas the regular instruct model cannot guess it after spending all 20 turns, in all 4 attempts that we ran (we only show first 9 turns for the sake of brevity).


BibTeX

 
      @misc{tajwar2025traininggenerallycuriousagent,
        title={Training a Generally Curious Agent}, 
        author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov},
        year={2025},
        eprint={2502.17543},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2502.17543}, 
      }