Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present PAPRIKA, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, PAPRIKA teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with PAPRIKA can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
The goal of our paper is to develop a scalable method to instill better strategic exploration and sequential decision-making capabilities into LLMs. Prior works (Krishnamurthy et al., 2024) have shown that LLMs can perform poorly on even the simple decision making task of multi-armed bandits. Nie et al., 2024 has since then demonstrated that LLMs can be taught to perform better on bandits after fine-tuning them on synthetic trajectories generated by known algorithms such as UCB. However, this idea is limited in scope for three reasons:
(Main Components of PAPRIKA) The PAPRIKA framework consists of task construction, task selection, generating good exploration behavior via diversity encouraging sampling from an LLM, and finally training the LLM on self-generated trajectories that attained high scores.
(PAPRIKA improves success rate on a diverse range of task groups) Average success rate on 6 representative task groups, with shaded areas representing standard error over 3 random seeds. PAPRIKA improves performance on all of them after fine-tuning on only roughly 22,500 total trajectories.
(Testing generalization of PAPRIKA via leave-one-out experiments) We test PAPRIKA's zero-shot performance on unseen task groups by leave-one-out (LOO) experiments, where we train the LLM on trajectories from every task group except the group we test on. All evaluations are done at temperature 0.7 and we report average success rate. Our experiments demonstrate that PAPRIKA can teach an LLM sequential decision making abilities that often transfers well to new tasks without needing any additional training.
(Multi-round training with curriculum on twenty questions) We demonstrate the efficacy of our curriculum learning algorithm for sampling training tasks by comparing its performance against uniform sampling for multi-round training. All evaluations are done at temperature 0.7, and shaded regions represent standard error over 3 seeds. (Left) Average success rate at each round. (Middle) Pass@4 success rate at each round. (Right) Success rate per each of easy, medium, and hard task groups. Overall, our curriculum learning algorithm shows 1.4% and 3.3% improvement over the uniform sampling baseline at average and pass@4 success rate respectively.
(Qualitative analysis of behaviors taught by PAPRIKA) PAPRIKA not only improves over the starting model in terms of success rate and other metrics, but also shows better sequential decision making abilities. Here we present two example trajectories on 20 questions, where the hidden topic the agent has to guess is a concept, and the correct answer is primary numbers. The PAPRIKA fine-tuned model asks much better quality questions and is able to guess the topic in 8 turns, whereas the regular instruct model cannot guess it after spending all 20 turns, in all 4 attempts that we ran (we only show first 9 turns for the sake of brevity).
@misc{tajwar2025traininggenerallycuriousagent, title={Training a Generally Curious Agent}, author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov}, year={2025}, eprint={2502.17543}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.17543}, }