EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

Liu, Lulin; Li, Dayou; Liang, Yiqing; Jiang, Sicong; Vijay, Hitesh; Hu, Hezhen; Xu, Xuhai; Liu, Zirui; Shakkottai, Srinivas; Li, Manling; Fan, Zhiwen

✨ EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks ✨

Lulin Liu^1,2*‡ Dayou Li^2* Yiqing Liang³ Sicong Jiang^4,5 Hitesh Vijay² Hezhen Hu⁶

Xuhai Xu⁷ Zirui Liu¹ Srinivas Shakkottai² Manling Li⁸ Zhiwen Fan^2†

¹University of Minnesota ²Texas A&M University ³Brown University ⁴McGill University

⁵2077AI ⁶The University of Texas at Austin ⁷Columbia University ⁸Northwestern University

^*Equal contribution. ^†Corresponding author.

^‡Lulin Liu was a visiting student at Texas A&M University during this work.

CVPR 2026 Findings

arXiv HF Data (coming soon)

We introduce EgoTL, an egocentric video benchmark for long-horizon household tasks, collected via a real-time "say-before-act" think-aloud protocol. We highlight VLM reasoning failures and demonstrate that fine-tuning models on EgoTL significantly improves their long-horizon planning and spatial understanding.

Abstract

Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.

Visual-Spatial Intelligence Teaser — What is the right data for teaching current vision-language foundation models human-like spatial perception and long-horizon egocentric reasoning? EgoTL introduces a say-before-act capture pipeline that records abstract household goals, think-aloud chains of thought, and explicit navigation and manipulation steps before execution. Grounded in metric 3D reconstructions and explicit action labels, EgoTL enables human-aligned supervision and diagnosis for long-horizon egocentric spatial reasoning.

Table 1 — Comparison of egocentric video datasets relevant to long-horizon, task-oriented navigation and manipulation. Columns indicate whether each dataset provides narration (*Narration*, including post-hoc or real-time “Think-Aloud”), explicit chain-of-thought supervision (*CoT*), task labels (*Task*), synchronized audio (*Audio*), navigation annotations (*Nav. Ann.*), and reasoning traces under complex cases (*Reason.*). ✓/✗ denote presence/absence; “Think-Aloud” denotes real-time narration and reasoning during task execution. Rows are ordered by year of release, with EgoTL (ours) shown last.

EgoTL-Bench Overview

Benchmark Overview: EgoTL-Bench decomposes egocentric spatial understanding into six tasks across three layers. Memory-conditioned planning asks the model to generate an action plan from a memory-bank walkthrough and a high-level goal. Scene-aware action reasoning tests whether it selects the correct action in cluttered scenes, such as moving an obstacle before opening a door. Next action prediction checks if the model can infer the immediate next step from the current frame and abstract task. At the perceptual layer, action recognition describes the ongoing manipulation, direction recognition identifies egocentric motion primitives (walking straight, turning, standing or sitting), and distance estimation predicts how far the subject walks in meters. The shown examples are simplified; the full benchmark uses more templates, distractors, and longer episodes.

Overview of EgoTL-Bench task categories — Tasks demonstration of EgoTL-Bench. Note: the questions above are simplified slightly for clarity and brevity.

Evaluation Summary: EgoTL-Bench evaluates large foundation models on long-horizon egocentric reasoning with six tasks from three layers: memory-conditioned planning, scene-aware action reasoning, and perceptual-metric understanding. We report two chance-level baselines, compare open-source and proprietary VLMs, and include an EgoTL-finetuned model. Multiple-choice tasks are measured with accuracy, while distance estimation uses mean relative accuracy (MRA). We also evaluate world models on 60-second CoT-conditioned rollouts using CLIP-Score and VBench. Overall, current foundation models still lag in robust long-horizon spatial grounding, while EgoTL-based finetuning improves planning, step-wise reasoning, and rollout consistency.

**Table 2: Evaluation of VLMs on EgoTL.** We evaluate chance-level baselines, open-source VLMs, proprietary VLMs, and our EgoTL-finetuned model across three layers of egocentric spatial understanding. For each task within the open-source VLMs group, dark gray highlights the best open-source model and light gray denotes the second-best open-source model.

Long-Horizon Video Generation: Modern world models can produce visually coherent short-horizon predictions, but on long-horizon egocentric tasks in EgoTL their rollouts quickly drift from the underlying 3D scene and intended plan: object size and position change over time, occlusions are inconsistent, and the camera may jump between rooms instead of following a continuous path. Even when individual frames remain photorealistic, the sequences often violate basic spatial constraints (room connectivity, landmark layout, object reachability), leading to trajectories that look locally reasonable yet fail to reach the final goal or handle obstacles correctly. These observations suggest that current long-horizon generative models lack an explicit egocentric spatial backbone, and that maintaining a consistent 3D cognitive map over hundreds of frames is essential if world-model-based agents are to support reliable navigation and manipulation in real homes.

Model comparison visuals on EgoTL — **Table 3: Interactive long-horizon video evaluation on EgoTL.** For each method, a world model generates a 60-second egocentric rollout conditioned on the same CoT sequence. The CLIP-Score columns report text-video alignment for each 10-second chunk (0--10 s, 10--20 s, ..., 50--60 s), where higher values mean better agreement with the target description. The Vbench columns report overall image quality, subject consistency (stability of the main actor), and background consistency (stability of the scene layout). COSMOS (vanilla) and WAN (vanilla) are off-the-shelf world models, while COSMOS (w/ EgoTL) is COSMOS finetuned on EgoTL; gains for this row indicate the benefit of egocentric think-aloud supervision for long-horizon generation.

BibTeX

@article{liu2026egotl,
  title={EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks},
  author={Lulin Liu and Dayou Li and Yiqing Liang and Sicong Jiang and Hitesh Vijay and Hezhen Hu and Xuhai Xu and Zirui Liu and Srinivas Shakkottai and Manling Li and Zhiwen Fan},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026},
  url={https://arxiv.org/abs/XXXX.XXXXX}
}

✨ EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks ✨

Abstract

Compare your spatial intelligence abilities with latest GPT!

Memory Planning Task 1: Plan a route from memory

Action Reasoning under Complex Env Task 2: Which action should be taken under the current frame?

Next Action Prediction Task 3: What happens next?

Action Recognition Task 4: What is the subject doing?

Direction Recognition Task 5: Which Direction?

Distance Estimation Task 6: How far apart?

EgoTL-Bench Overview

BibTeX