AstraNav-World: World Model for Foresight Control and Consistency

Language: 中文 | English

1 minute read

Published: December 25, 2025

Note: this is the English version paired with the Chinese post AstraNav-World: World Model for Foresight Control and Consistency.

Abstract

We propose AstraNav-World, an end-to-end world model for embodied navigation in open and dynamic environments. The model unifies multi-step visual prediction and action sequence reasoning into a single probabilistic framework by combining diffusion-based video generation with a vision-language policy. A bidirectional constraint mechanism enforces both the executability of predicted futures and the physical consistency of actions, which largely mitigates error accumulation in the traditional “predict-then-plan” pipeline. Experiments on diverse navigation benchmarks show improved trajectory accuracy and task success rate, and the model exhibits strong zero-shot generalization in real-world tests.

Introduction

Here you can briefly motivate:

Why foresight control is crucial for embodied agents in open worlds;
Limitations of decoupled prediction and planning pipelines;
The target audience (CV / robotics / embodied AI researchers and engineers).

Method

1. Problem Definition

Clearly state the embodied navigation setting and evaluation protocol, and define what “foresight control” and “consistency” mean in this work.

2. Approach

Describe the AstraNav-World architecture:

multi-module design combining diffusion video generator and VL policy;
training objectives for action-conditioned visual prediction and policy learning;
bidirectional constraints that tie predicted futures to executable actions.

3. Experiments

Summarize:

benchmarks, metrics and baselines;
ablations on each key module and training objective;
zero-shot transfer from simulation to real-world environments.

4. Results and Analysis

Discuss:

trajectory and success-rate improvements;
what happens when coupling between vision and action is removed;
qualitative examples that illustrate better foresight and consistency.

Conclusion and Future Work

Highlight the main takeaways and outline:

deployment to real robots;
extension to interaction / manipulation tasks;
richer multi-modal inputs and better interpretability.

References

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published: December 25, 2025

个人小结

也是工作量极大的一篇工作，我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征，并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法，个人感觉在结构上，我们的模型还是比较超前的，并且里面也有非常多的小细节，后面可以细细描述一下。

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验，从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性，但依赖检测与重建流水线，鲁棒性和可扩展性受限。为此，本文提出AstraNav-Memory以图像为中心的记忆框架，通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合，实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络，结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer，支持可配置的压缩率——如16倍压缩设置下，每张图像仅编码为约30个token，将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明，该方法取得了SOTA的导航性能，提升了陌生环境的探索效率，同时缩短了熟悉环境中的导航路径。消融实验进一步证明，适度的压缩率能在效率与精度间实现最优平衡。该研究证实，经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口，使其能基于长时视觉历史进行推理，实现类人的高效导航。

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published: August 11, 2025

个人小结

做了很久，工作量有点大。。讲一下这个工作的心路历程吧，一开始是想简单的做一个统一的能同时生成多个特效的视频，但是后面发现有些特效其实比较难以兼容，并且有些特效是个体级别的（比如物体消失、爆炸），有些则是画面级别的（比如天降大雪、花花世界这种）。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作，所以就由着这条路继续走了，一开始当然是沿着ControlNet走的，但是确实会引入较大的计算量，后面发现了EasyConrtol，觉得这种在attention级别的mask实现会更好一些，但最后效果上个人感觉差异不是很大。

Jasper (Jintao Chen)

AstraNav-World: World Model for Foresight Control and Consistency

Abstract

Introduction

Method

1. Problem Definition

2. Approach

3. Experiments

4. Results and Analysis

Conclusion and Future Work

References

Share on

You May Also Enjoy