VSC-RL

ADVANCING AUTONOMOUS VLM AGENTS
VIA VARIATIONAL SUBGOAL-CONDITIONED REINFORCEMENT LEARNING

Qingyuan Wu^{1 2 *}, Jianheng Liu^{3 *}, Jianye Hao^{3 4}, Jun Wang⁵, Kun Shao^{3 †}

¹ University of Liverpool ² University of Southampton ³ Huawei Noah's Ark Lab
⁴ Tianjin University ⁵ University College London
^*Equal Contribution
^†Corresponding authors: shaokun2@huawei.com

Abstract

State-of-the-art (SOTA) reinforcement learning (RL) methods enable the vision-language agents to learn from interactions with the environment without human supervision. However, they struggle with learning inefficiencies in tackling real-world complex sequential decision-making tasks, especially with sparse reward signals and long-horizon dependencies. To effectively address the issue, we introduce Variational Subgoal-Conditioned RL (VSC-RL), which reformulates the vision-language sequential decision-making task as a variational goal-conditioned RL problem, allowing us to leverage advanced optimization methods to enhance learning efficiency. Specifically, VSC-RL optimizes the SubGoal Evidence Lower BOund (SGC-ELBO), which consists of (a) maximizing the subgoal-conditioned return via RL and (b) minimizing the subgoal-conditioned difference with the reference policy. We theoretically demonstrate that SGC-ELBO is equivalent to the original optimization objective, ensuring improved learning efficiency without sacrificing performance guarantees. Additionally, for real-world complex decision-making tasks, VSC-RL leverages the vision-language model to autonomously decompose the goal into feasible subgoals, enabling efficient learning. Across various benchmarks, including challenging real-world mobile device control tasks, VSC-RL significantly outperforms the SOTA vision-language agents, achieving superior performance and remarkable improvement in learning efficiency.

Our Approach: VSC-RL

We propose VSC-RL (Variational Subgoal-Conditioned Reinforcement Learning), a novel reinforcement learning framework that enhances vision-language agents by improving learning efficiency in complex sequential decision-making tasks. We formulate the problem as a variational goal-conditioned RL problem and introduce the SubGoal-Conditioned Evidence Lower Bound (SGC-ELBO) as the optimization objective. Our approach leverages Vision-Language Models (VLMs) to autonomously decompose high-level goals into feasible subgoals, addressing challenges in long-horizon tasks with sparse rewards. We then optimize the agent’s policy by (a) maximizing subgoal-conditioned RL returns and (b) minimizing subgoal-conditioned behavior differences from a reference policy, ensuring both sample efficiency and performance guarantees.

Figure 1. The pipeline of VSC-RL. (a) VLM autonomously decomposes the goal \( g \) to the subgoals \( \{sg_i\}_{i=1}^N \). VSC-RL optimizes the objective of SGC-ELBO consisting of (b) maximizing the subgoal-conditioned return and (c) minimizing the subgoal-conditioned difference.

The SGC-ELBO is derived by decomposing the original goal-conditioned problem into smaller subgoal-conditioned tasks:

SGC-ELBO(π, π_ref, sg_i, g) = E_{τ_i ~ p_π(τ_i|sg_i)} [log p(O|τ_i, sg_i)] - KL(p_π(τ_i|sg_i) || p_{π_ref}(τ_i|g))

where the first term maximizes the likelihood of observing outcomes \(O\) given the subgoal \(sg_i\), and the second term minimizes the Kullback-Leibler divergence (KL-divergence) between the policy \( \pi \) and the reference policy \( \pi_{\text{ref}} \), ensuring that the learned policy aligns with the reference policy while effectively solving the subgoal.

Figure 2. Autonomous vision-language subgoal generation in AitW task. The vision-language model autonomously decomposes the goal of the complicated mobile device control task into easily achievable subgoals.

We employ Vision-Language Models (VLMs) to autonomously generate subgoals. Given a high-level goal \(g\), the VLM generates a set of feasible subgoals \( \{ sg_i \}_{i=1}^N \) that break down the task into simpler, achievable steps. The optimization objective for VSC-RL, using a VLM as the subgoal generator, can be written as:

max_⁢ π ⁢ [⁢ ∑_i=1^N ⁢ VLM(g) ⁢ [SGC-ELBO(π, π_ref, sg_i, g)]]

In this optimization, we maximize the sum of SGC-ELBO values for all generated subgoals. This process ensures that the agent not only learns to perform tasks efficiently but also generalizes across various complex decision-making problems.

Results

To evaluate the efficacy of VSC-RL, we conducted extensive experiments on challenging vision-language decision-making benchmarks. Our results on mobile device and web control tasks in the AitW General and Web Shopping and WebArena-Lite seperately demonstrate that VSC-RL significantly improves both learning efficiency and task success rate compared to state-of-the-art (SOTA) baselines.

VSC-RL significantly outperforms all baselines on the AitW General and Web Shopping tasks, achieving a final success rate of 0.75 in the General task and 0.6 in the Web Shopping task, surpassing the best baseline DigiRL by 15% and 20% seperately.

Task	Task Split	Set-of-Marks	AppAgent	CogAgent	AutoUI	Filtered BC	DigiRL	VSC-RL (ours)
Task	Task Split	General	Train	32.3%	14.6%	25.0%	12.5%	53.9%	64.9%	73.9%
Test	16.7%	General	16.7%	25.0%	14.6%	62.5%	67.7%	72.9%
Web Shopping	Train	6.3%	5.2%	31.3%	14.6%	53.6%	55.3%	64.0%
	Test	11.5%	8.3%	38.5%	17.7%	54.2%	41.3%	59.0%

The evaluated performance on the train and test subsets of the AitW tasks. The best performance is highlighted.

VSC-RL achieves superior learning efficiency and final performance on the WebArena-Lite tasks. It consistently outperforms the other methods, achieving the highest success rate of approximately 34.5%.

Method	Task (# Ratio)
Method	Reddit (12.7%)	Gitlab (19.4%)	CMS (21.2%)	Map (18.8%)	OSS (27.9%)	All (100.0%)
SFT	36.8%	6.7%	20.0%	33.3%	17.8%	20.6%
Filtered BC	52.6%	20.0%	31.4%	23.3%	8.9%	23.0%
AWR	57.9%	26.7%	31.4%	26.7%	17.8%	28.5%
DigiRL	52.4%	28.1%	37.1%	32.3%	15.2%	30.3%
WebRL	57.1%	28.1%	34.3%	35.5%	15.2%	30.9%
VSC-RL (ours)	61.9%	31.3%	40.0%	35.5%	19.6%	34.5%

The evaluated performance on WebArena-Lite tasks. The best performance is highlighted.

For full results and more details, please refer to our paper.

@article{wu2025vsc, title={VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning}, author={Wu, Qingyuan and Liu, Jianheng and Hao, Jianye and Wang, Jun and Shao, Kun}, journal={arXiv preprint arXiv:2502.07949}, year={2025} }

ADVANCING AUTONOMOUS VLM AGENTS
VIA VARIATIONAL SUBGOAL-CONDITIONED REINFORCEMENT LEARNING

Qualitative example of VSC-RL on the Web Shopping task:

Abstract

Our Approach: VSC-RL

Figure 1. The pipeline of VSC-RL. (a) VLM autonomously decomposes the goal \( g \) to the subgoals \( \{sg_i\}_{i=1}^N \). VSC-RL optimizes the objective of SGC-ELBO consisting of (b) maximizing the subgoal-conditioned return and (c) minimizing the subgoal-conditioned difference.

Figure 2. Autonomous vision-language subgoal generation in AitW task. The vision-language model autonomously decomposes the goal of the complicated mobile device control task into easily achievable subgoals.

Results

VSC-RL significantly outperforms all baselines on the AitW General and Web Shopping tasks, achieving a final success rate of 0.75 in the General task and 0.6 in the Web Shopping task, surpassing the best baseline DigiRL by 15% and 20% seperately.

The evaluated performance on the train and test subsets of the AitW tasks. The best performance is highlighted.

VSC-RL achieves superior learning efficiency and final performance on the WebArena-Lite tasks. It consistently outperforms the other methods, achieving the highest success rate of approximately 34.5%.

The evaluated performance on WebArena-Lite tasks. The best performance is highlighted.

For full results and more details, please refer to our paper.

BibTeX

ADVANCING AUTONOMOUS VLM AGENTSVIA VARIATIONAL SUBGOAL-CONDITIONED REINFORCEMENT LEARNING

Qualitative example of VSC-RL on the Web Shopping task:

Abstract

Our Approach: VSC-RL

Figure 1. The pipeline of VSC-RL. (a) VLM autonomously decomposes the goal \( g \) to the subgoals \( \{sg_i\}_{i=1}^N \). VSC-RL optimizes the objective of SGC-ELBO consisting of (b) maximizing the subgoal-conditioned return and (c) minimizing the subgoal-conditioned difference.

Figure 2. Autonomous vision-language subgoal generation in AitW task. The vision-language model autonomously decomposes the goal of the complicated mobile device control task into easily achievable subgoals.

Results

VSC-RL significantly outperforms all baselines on the AitW General and Web Shopping tasks, achieving a final success rate of 0.75 in the General task and 0.6 in the Web Shopping task, surpassing the best baseline DigiRL by 15% and 20% seperately.

The evaluated performance on the train and test subsets of the AitW tasks. The best performance is highlighted.

VSC-RL achieves superior learning efficiency and final performance on the WebArena-Lite tasks. It consistently outperforms the other methods, achieving the highest success rate of approximately 34.5%.

The evaluated performance on WebArena-Lite tasks. The best performance is highlighted.

For full results and more details, please refer to our paper.

BibTeX

ADVANCING AUTONOMOUS VLM AGENTS
VIA VARIATIONAL SUBGOAL-CONDITIONED REINFORCEMENT LEARNING