ADVANCING AUTONOMOUS VLM AGENTS
VIA VARIATIONAL SUBGOAL-CONDITIONED REINFORCEMENT LEARNING

Qingyuan Wu1 2 *, Jianheng Liu3 *, Jianye Hao3 4, Jun Wang5, Kun Shao3 †
1 University of Liverpool 2 University of Southampton 3 Huawei Noah's Ark Lab
4 Tianjin University 5 University College London

* Equal Contribution

Corresponding authors: shaokun2@huawei.com

Qualitative example of VSC-RL on the Web Shopping task:

Abstract

State-of-the-art (SOTA) reinforcement learning (RL) methods enable the vision-language agents to learn from interactions with the environment without human supervision. However, they struggle with learning inefficiencies in tackling real-world complex sequential decision-making tasks, especially with sparse reward signals and long-horizon dependencies. To effectively address the issue, we introduce Variational Subgoal-Conditioned RL (VSC-RL), which reformulates the vision-language sequential decision-making task as a variational goal-conditioned RL problem, allowing us to leverage advanced optimization methods to enhance learning efficiency. Specifically, VSC-RL optimizes the SubGoal Evidence Lower BOund (SGC-ELBO), which consists of (a) maximizing the subgoal-conditioned return via RL and (b) minimizing the subgoal-conditioned difference with the reference policy. We theoretically demonstrate that SGC-ELBO is equivalent to the original optimization objective, ensuring improved learning efficiency without sacrificing performance guarantees. Additionally, for real-world complex decision-making tasks, VSC-RL leverages the vision-language model to autonomously decompose the goal into feasible subgoals, enabling efficient learning. Across various benchmarks, including challenging real-world mobile device control tasks, VSC-RL significantly outperforms the SOTA vision-language agents, achieving superior performance and remarkable improvement in learning efficiency.

Our Approach: VSC-RL

We propose VSC-RL (Variational Subgoal-Conditioned Reinforcement Learning), a novel reinforcement learning framework that enhances vision-language agents by improving learning efficiency in complex sequential decision-making tasks. We formulate the problem as a variational goal-conditioned RL problem and introduce the SubGoal-Conditioned Evidence Lower Bound (SGC-ELBO) as the optimization objective. Our approach leverages Vision-Language Models (VLMs) to autonomously decompose high-level goals into feasible subgoals, addressing challenges in long-horizon tasks with sparse rewards. We then optimize the agent’s policy by (a) maximizing subgoal-conditioned RL returns and (b) minimizing subgoal-conditioned behavior differences from a reference policy, ensuring both sample efficiency and performance guarantees.

Figure 1. The pipeline of VSC-RL. (a) VLM autonomously decomposes the goal \( g \) to the subgoals \( \{sg_i\}_{i=1}^N \). VSC-RL optimizes the objective of SGC-ELBO consisting of (b) maximizing the subgoal-conditioned return and (c) minimizing the subgoal-conditioned difference.


The SGC-ELBO is derived by decomposing the original goal-conditioned problem into smaller subgoal-conditioned tasks:

SGC-ELBO(π, πref, sgi, g) = Eτi ~ pπi|sgi) [log p(O|τi, sgi)] - KL(pπi|sgi) || pπrefi|g))

where the first term maximizes the likelihood of observing outcomes \(O\) given the subgoal \(sg_i\), and the second term minimizes the Kullback-Leibler divergence (KL-divergence) between the policy \( \pi \) and the reference policy \( \pi_{\text{ref}} \), ensuring that the learned policy aligns with the reference policy while effectively solving the subgoal.


Figure 2. Autonomous vision-language subgoal generation in AitW task. The vision-language model autonomously decomposes the goal of the complicated mobile device control task into easily achievable subgoals.

We employ Vision-Language Models (VLMs) to autonomously generate subgoals. Given a high-level goal \(g\), the VLM generates a set of feasible subgoals \( \{ sg_i \}_{i=1}^N \) that break down the task into simpler, achievable steps. The optimization objective for VSC-RL, using a VLM as the subgoal generator, can be written as:

max π ⁢ [⁢ ∑i=1N ⁢ VLM(g) ⁢ [SGC-ELBO(π, πref, sgi, g)]]

In this optimization, we maximize the sum of SGC-ELBO values for all generated subgoals. This process ensures that the agent not only learns to perform tasks efficiently but also generalizes across various complex decision-making problems.


Results

To evaluate the efficacy of VSC-RL, we conducted extensive experiments on challenging vision-language decision-making benchmarks. Our results on mobile device and web control tasks in the AitW General and Web Shopping and WebArena-Lite seperately demonstrate that VSC-RL significantly improves both learning efficiency and task success rate compared to state-of-the-art (SOTA) baselines.





VSC-RL significantly outperforms all baselines on the AitW General and Web Shopping tasks, achieving a final success rate of 0.75 in the General task and 0.6 in the Web Shopping task, surpassing the best baseline DigiRL by 15% and 20% seperately.


Task Task Split Set-of-Marks AppAgent CogAgent AutoUI Filtered BC DigiRL VSC-RL (ours)
General Train 32.3% 14.6% 25.0% 12.5% 53.9% 64.9% 73.9%
Test 16.7% 16.7% 25.0% 14.6% 62.5% 67.7% 72.9%
Web Shopping Train 6.3% 5.2% 31.3% 14.6% 53.6% 55.3% 64.0%
Test 11.5% 8.3% 38.5% 17.7% 54.2% 41.3% 59.0%

The evaluated performance on the train and test subsets of the AitW tasks. The best performance is highlighted.

VSC-RL achieves superior learning efficiency and final performance on the WebArena-Lite tasks. It consistently outperforms the other methods, achieving the highest success rate of approximately 34.5%.

Method
Task (# Ratio)
Reddit
(12.7%)
Gitlab
(19.4%)
CMS
(21.2%)
Map
(18.8%)
OSS
(27.9%)
All
(100.0%)
SFT 36.8% 6.7% 20.0% 33.3% 17.8% 20.6%
Filtered BC 52.6% 20.0% 31.4% 23.3% 8.9% 23.0%
AWR 57.9% 26.7% 31.4% 26.7% 17.8% 28.5%
DigiRL 52.4% 28.1% 37.1% 32.3% 15.2% 30.3%
WebRL 57.1% 28.1% 34.3% 35.5% 15.2% 30.9%
VSC-RL (ours) 61.9% 31.3% 40.0% 35.5% 19.6% 34.5%

The evaluated performance on WebArena-Lite tasks. The best performance is highlighted.

For full results and more details, please refer to our paper.

BibTeX


@article{wu2025vsc,
  title={VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning},
  author={Wu, Qingyuan and Liu, Jianheng and Hao, Jianye and Wang, Jun and Shao, Kun},
  journal={arXiv preprint arXiv:2502.07949},
  year={2025}
}