VSC-RL:
ADVANCING AUTONOMOUS VISION-LANGUAGE AGENTS WITH VARIATIONAL SUBGOAL-CONDITIONED REINFORCEMENT LEARNING

Qingyuan Wu1 * †, Jianheng Liu2 *, Jianye Hao2 3, Jun Wang4, Kun Shao2 †,
1 University of Liverpool
2 Huawei Noah's Ark Lab
3 Tianjin University
4 University College London

* Equal Contribution

Corresponding authors: qingwu2@liverpool.ac.uk, shaokun2@huawei.com

Qualitative example of VSC-RL on the Web Shopping task:

Abstract

State-of-the-art (SOTA) reinforcement learning (RL) methods enable the vision-language agents to learn from interactions with the environment without human supervision. However, they struggle with learning inefficiencies in tackling real-world complex sequential decision-making tasks, especially with sparse reward signals and long-horizon dependencies. To effectively address the issue, we introduce Variational Subgoal-Conditioned RL (VSC-RL), which reformulates the vision-language sequential decision-making task as a variational goal-conditioned RL problem, allowing us to leverage advanced optimization methods to enhance learning efficiency. Specifically, VSC-RL optimizes the SubGoal Evidence Lower BOund (SGC-ELBO), which consists of (a) maximizing the subgoal-conditioned return via RL and (b) minimizing the subgoal-conditioned difference with the reference policy. We theoretically demonstrate that SGC-ELBO is equivalent to the original optimization objective, ensuring improved learning efficiency without sacrificing performance guarantees. Additionally, for real-world complex decision-making tasks, VSC-RL leverages the vision-language model to autonomously decompose the goal into feasible subgoals, enabling efficient learning. Across various benchmarks, including challenging real-world mobile device control tasks, VSC-RL significantly outperforms the SOTA vision-language agents, achieving superior performance and remarkable improvement in learning efficiency.

Our Approach: VSC-RL

We propose VSC-RL (Variational Subgoal-Conditioned Reinforcement Learning), a novel reinforcement learning framework that enhances vision-language agents by improving learning efficiency in complex sequential decision-making tasks. We formulate the problem as a variational goal-conditioned RL problem and introduce the SubGoal-Conditioned Evidence Lower Bound (SGC-ELBO) as the optimization objective. Our approach leverages Vision-Language Models (VLMs) to autonomously decompose high-level goals into feasible subgoals, addressing challenges in long-horizon tasks with sparse rewards. We then optimize the agent’s policy by (a) maximizing subgoal-conditioned RL returns and (b) minimizing subgoal-conditioned behavior differences from a reference policy, ensuring both sample efficiency and performance guarantees.

The pipeline of VSC-RL. (a) VLM autonomously decomposes the goal \( g \) to the subgoals \( \{sg_i\}_{i=1}^N \). VSC-RL optimizes the objective of SGC-ELBO consisting of (b) maximizing the subgoal-conditioned return and (c) minimizing the subgoal-conditioned difference.


The SGC-ELBO is derived by decomposing the original goal-conditioned problem into smaller subgoal-conditioned tasks:

SGC-ELBO(π, πref, sgi, g) = Eτi ~ pπi|sgi) [log p(O|τi, sgi)] - KL(pπi|sgi) || pπrefi|g))

where the first term maximizes the likelihood of observing outcomes \(O\) given the subgoal \(sg_i\), and the second term minimizes the Kullback-Leibler divergence (KL-divergence) between the policy \( \pi \) and the reference policy \( \pi_{\text{ref}} \), ensuring that the learned policy aligns with the reference policy while effectively solving the subgoal.


Autonomous vision-language subgoal generation in AitW task. The vision-language model autonomously decomposes the goal of the complicated mobile device control task into easily achievable subgoals.

We employ Vision-Language Models (VLMs) to autonomously generate subgoals. Given a high-level goal \(g\), the VLM generates a set of feasible subgoals \( \{ sg_i \}_{i=1}^N \) that break down the task into simpler, achievable steps. The optimization objective for VSC-RL, using a VLM as the subgoal generator, can be written as:

max π ⁢ [⁢ ∑i=1N ⁢ VLM(g) ⁢ [SGC-ELBO(π, πref, sgi, g)]]

In this optimization, we maximize the sum of SGC-ELBO values for all generated subgoals. This process ensures that the agent not only learns to perform tasks efficiently but also generalizes across various complex decision-making problems.


Results

To evaluate the efficacy of VSC-RL, we conducted extensive experiments on challenging vision-language decision-making benchmarks, particularly focusing on mobile device control tasks in the AitW General and Web Shopping datasets. Our results demonstrate that VSC-RL significantly improves both learning efficiency and task success rate compared to state-of-the-art (SOTA) baselines.




VSC-RL significantly outperforms all baselines in both the AitW General and Web Shopping tasks, achieving a final success rate of 0.75 in the General task and 0.6 in the Web Shopping task, surpassing the best baseline DigiRL by 15% and 20% seperately.


Task Task Split Set-of-Marks AppAgent CogAgent AutoUI Filtered BC DigiRL VSC-RL (ours)
General Train 32.3% 14.6% 25.0% 12.5% 53.9% 64.9% 73.9%
Test 16.7% 16.7% 25.0% 14.6% 62.5% 67.7% 72.9%
Web Shopping Train 6.3% 5.2% 31.3% 14.6% 53.6% 55.3% 64.0%
Test 11.5% 8.3% 38.5% 17.7% 54.2% 41.3% 59.0%

The evaluated performance on the train and test datasets of the General and Web Shopping tasks. The best performance is in bold.


For full results and more details, please refer to our paper.

BibTeX


@article{wu2025vsc,
  title={VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning},
  author={Wu, Qingyuan and Liu, Jianheng and Hao, Jianye and Wang, Jun and Shao, Kun},
  journal={arXiv preprint arXiv:2502.07949},
  year={2025}
}