Posts by Tags

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published: August 11, 2025

个人小结

做了很久，工作量有点大。。讲一下这个工作的心路历程吧，一开始是想简单的做一个统一的能同时生成多个特效的视频，但是后面发现有些特效其实比较难以兼容，并且有些特效是个体级别的（比如物体消失、爆炸），有些则是画面级别的（比如天降大雪、花花世界这种）。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作，所以就由着这条路继续走了，一开始当然是沿着ControlNet走的，但是确实会引入较大的计算量，后面发现了EasyConrtol，觉得这种在attention级别的mask实现会更好一些，但最后效果上个人感觉差异不是很大。

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published: July 15, 2025

个人小结

现在的视频生成似乎又个问题是，我给一段prompt：比如 “一个人先跑步、再跳、再走、再跑起来”，对于这种具有多个TNA的时间，应该怎么合理的安排视频时间和事件的对应关系呢？值得思考的问题～(已经有了一些类似的工作比如SwitchCraft，但是更promissing的感觉是加在自回归式的pipeline里面)。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

less than 1 minute read

Published: June 17, 2025

个人小结

当时没注意到，后面一系列的visual tokenizer的自回归工作都用到了类似的双编码器结构，可惜当时只探究了压缩重建，没有探索生成的方面。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published: June 17, 2025

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published: December 25, 2025

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验，从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性，但依赖检测与重建流水线，鲁棒性和可扩展性受限。为此，本文提出AstraNav-Memory以图像为中心的记忆框架，通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合，实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络，结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer，支持可配置的压缩率——如16倍压缩设置下，每张图像仅编码为约30个token，将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明，该方法取得了SOTA的导航性能，提升了陌生环境的探索效率，同时缩短了熟悉环境中的导航路径。消融实验进一步证明，适度的压缩率能在效率与精度间实现最优平衡。该研究证实，经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口，使其能基于长时视觉历史进行推理，实现类人的高效导航。

[CVPR]UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

less than 1 minute read

Published: August 05, 2025

个人小结

FlowEdit当时被我们发现之后，并在Wan等模型上做了复现，发现效果出奇的好。当时的一条思路是沿着FlowEdit做一些运动迁移的任务，有了一些效果但是FlowEdit有着明显的无法做形状差异过大的编辑：比如把大象变成一只小狗，或者删除某个物体。但是如果有个生成语义token而不是像素级Token的DiT，这些问题似乎就可以迎刃而解了，比如Blip-3里面的生成clip token的DiT以及RAE、ScaleRAE等工作。并且语义级token的好处是可以无缝接入一个understanding expert，自动化评判编辑强度。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published: June 17, 2025

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published: June 17, 2025

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published: July 15, 2025

个人小结

现在的视频生成似乎又个问题是，我给一段prompt：比如 “一个人先跑步、再跳、再走、再跑起来”，对于这种具有多个TNA的时间，应该怎么合理的安排视频时间和事件的对应关系呢？值得思考的问题～(已经有了一些类似的工作比如SwitchCraft，但是更promissing的感觉是加在自回归式的pipeline里面)。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published: June 17, 2025

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published: August 11, 2025

个人小结

做了很久，工作量有点大。。讲一下这个工作的心路历程吧，一开始是想简单的做一个统一的能同时生成多个特效的视频，但是后面发现有些特效其实比较难以兼容，并且有些特效是个体级别的（比如物体消失、爆炸），有些则是画面级别的（比如天降大雪、花花世界这种）。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作，所以就由着这条路继续走了，一开始当然是沿着ControlNet走的，但是确实会引入较大的计算量，后面发现了EasyConrtol，觉得这种在attention级别的mask实现会更好一些，但最后效果上个人感觉差异不是很大。

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

1 minute read

Published: August 11, 2025

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published: June 17, 2025

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published: June 17, 2025

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

[CVPR]UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

less than 1 minute read

Published: August 05, 2025

个人小结

FlowEdit当时被我们发现之后，并在Wan等模型上做了复现，发现效果出奇的好。当时的一条思路是沿着FlowEdit做一些运动迁移的任务，有了一些效果但是FlowEdit有着明显的无法做形状差异过大的编辑：比如把大象变成一只小狗，或者删除某个物体。但是如果有个生成语义token而不是像素级Token的DiT，这些问题似乎就可以迎刃而解了，比如Blip-3里面的生成clip token的DiT以及RAE、ScaleRAE等工作。并且语义级token的好处是可以无缝接入一个understanding expert，自动化评判编辑强度。

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published: August 11, 2025

个人小结

做了很久，工作量有点大。。讲一下这个工作的心路历程吧，一开始是想简单的做一个统一的能同时生成多个特效的视频，但是后面发现有些特效其实比较难以兼容，并且有些特效是个体级别的（比如物体消失、爆炸），有些则是画面级别的（比如天降大雪、花花世界这种）。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作，所以就由着这条路继续走了，一开始当然是沿着ControlNet走的，但是确实会引入较大的计算量，后面发现了EasyConrtol，觉得这种在attention级别的mask实现会更好一些，但最后效果上个人感觉差异不是很大。

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

1 minute read

Published: August 11, 2025

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published: December 25, 2025

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验，从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性，但依赖检测与重建流水线，鲁棒性和可扩展性受限。为此，本文提出AstraNav-Memory以图像为中心的记忆框架，通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合，实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络，结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer，支持可配置的压缩率——如16倍压缩设置下，每张图像仅编码为约30个token，将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明，该方法取得了SOTA的导航性能，提升了陌生环境的探索效率，同时缩短了熟悉环境中的导航路径。消融实验进一步证明，适度的压缩率能在效率与精度间实现最优平衡。该研究证实，经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口，使其能基于长时视觉历史进行推理，实现类人的高效导航。

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

1 minute read

Published: August 11, 2025

AstraNav-World: World Model for Foresight Control and Consistency

1 minute read

Published: December 25, 2025

AstraNav-World: World Model for Foresight Control and Consistency

1 minute read

Published: December 25, 2025

AstraNav-World: World Model for Foresight Control and Consistency

1 minute read

Published: December 25, 2025

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

1 minute read

Published: August 11, 2025

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

1 minute read

Published: August 11, 2025

AstraNav-World: World Model for Foresight Control and Consistency

1 minute read

Published: December 25, 2025

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

1 minute read

Published: August 11, 2025

AstraNav-World: World Model for Foresight Control and Consistency

1 minute read

Published: December 25, 2025

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published: December 25, 2025

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验，从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性，但依赖检测与重建流水线，鲁棒性和可扩展性受限。为此，本文提出AstraNav-Memory以图像为中心的记忆框架，通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合，实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络，结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer，支持可配置的压缩率——如16倍压缩设置下，每张图像仅编码为约30个token，将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明，该方法取得了SOTA的导航性能，提升了陌生环境的探索效率，同时缩短了熟悉环境中的导航路径。消融实验进一步证明，适度的压缩率能在效率与精度间实现最优平衡。该研究证实，经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口，使其能基于长时视觉历史进行推理，实现类人的高效导航。

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published: December 25, 2025

个人小结

也是工作量极大的一篇工作，我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征，并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法，个人感觉在结构上，我们的模型还是比较超前的，并且里面也有非常多的小细节，后面可以细细描述一下。

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published: December 25, 2025

个人小结

也是工作量极大的一篇工作，我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征，并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法，个人感觉在结构上，我们的模型还是比较超前的，并且里面也有非常多的小细节，后面可以细细描述一下。

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published: December 25, 2025

个人小结

也是工作量极大的一篇工作，我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征，并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法，个人感觉在结构上，我们的模型还是比较超前的，并且里面也有非常多的小细节，后面可以细细描述一下。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

less than 1 minute read

Published: June 17, 2025

个人小结

当时没注意到，后面一系列的visual tokenizer的自回归工作都用到了类似的双编码器结构，可惜当时只探究了压缩重建，没有探索生成的方面。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

less than 1 minute read

Published: June 17, 2025

个人小结

当时没注意到，后面一系列的visual tokenizer的自回归工作都用到了类似的双编码器结构，可惜当时只探究了压缩重建，没有探索生成的方面。

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published: July 15, 2025

个人小结

现在的视频生成似乎又个问题是，我给一段prompt：比如 “一个人先跑步、再跳、再走、再跑起来”，对于这种具有多个TNA的时间，应该怎么合理的安排视频时间和事件的对应关系呢？值得思考的问题～(已经有了一些类似的工作比如SwitchCraft，但是更promissing的感觉是加在自回归式的pipeline里面)。

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published: August 11, 2025

个人小结

做了很久，工作量有点大。。讲一下这个工作的心路历程吧，一开始是想简单的做一个统一的能同时生成多个特效的视频，但是后面发现有些特效其实比较难以兼容，并且有些特效是个体级别的（比如物体消失、爆炸），有些则是画面级别的（比如天降大雪、花花世界这种）。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作，所以就由着这条路继续走了，一开始当然是沿着ControlNet走的，但是确实会引入较大的计算量，后面发现了EasyConrtol，觉得这种在attention级别的mask实现会更好一些，但最后效果上个人感觉差异不是很大。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

less than 1 minute read

Published: June 17, 2025

个人小结

当时没注意到，后面一系列的visual tokenizer的自回归工作都用到了类似的双编码器结构，可惜当时只探究了压缩重建，没有探索生成的方面。

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published: December 25, 2025

个人小结

也是工作量极大的一篇工作，我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征，并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法，个人感觉在结构上，我们的模型还是比较超前的，并且里面也有非常多的小细节，后面可以细细描述一下。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

less than 1 minute read

Published: June 17, 2025

个人小结

当时没注意到，后面一系列的visual tokenizer的自回归工作都用到了类似的双编码器结构，可惜当时只探究了压缩重建，没有探索生成的方面。

[CVPR]UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

less than 1 minute read

Published: August 05, 2025

个人小结

FlowEdit当时被我们发现之后，并在Wan等模型上做了复现，发现效果出奇的好。当时的一条思路是沿着FlowEdit做一些运动迁移的任务，有了一些效果但是FlowEdit有着明显的无法做形状差异过大的编辑：比如把大象变成一只小狗，或者删除某个物体。但是如果有个生成语义token而不是像素级Token的DiT，这些问题似乎就可以迎刃而解了，比如Blip-3里面的生成clip token的DiT以及RAE、ScaleRAE等工作。并且语义级token的好处是可以无缝接入一个understanding expert，自动化评判编辑强度。

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

less than 1 minute read

Published: June 17, 2025

个人小结

当时没注意到，后面一系列的visual tokenizer的自回归工作都用到了类似的双编码器结构，可惜当时只探究了压缩重建，没有探索生成的方面。

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published: December 25, 2025

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验，从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性，但依赖检测与重建流水线，鲁棒性和可扩展性受限。为此，本文提出AstraNav-Memory以图像为中心的记忆框架，通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合，实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络，结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer，支持可配置的压缩率——如16倍压缩设置下，每张图像仅编码为约30个token，将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明，该方法取得了SOTA的导航性能，提升了陌生环境的探索效率，同时缩短了熟悉环境中的导航路径。消融实验进一步证明，适度的压缩率能在效率与精度间实现最优平衡。该研究证实，经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口，使其能基于长时视觉历史进行推理，实现类人的高效导航。

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published: December 25, 2025

个人小结

也是工作量极大的一篇工作，我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征，并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法，个人感觉在结构上，我们的模型还是比较超前的，并且里面也有非常多的小细节，后面可以细细描述一下。

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published: December 25, 2025

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验，从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性，但依赖检测与重建流水线，鲁棒性和可扩展性受限。为此，本文提出AstraNav-Memory以图像为中心的记忆框架，通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合，实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络，结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer，支持可配置的压缩率——如16倍压缩设置下，每张图像仅编码为约30个token，将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明，该方法取得了SOTA的导航性能，提升了陌生环境的探索效率，同时缩短了熟悉环境中的导航路径。消融实验进一步证明，适度的压缩率能在效率与精度间实现最优平衡。该研究证实，经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口，使其能基于长时视觉历史进行推理，实现类人的高效导航。

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published: July 15, 2025

个人小结

现在的视频生成似乎又个问题是，我给一段prompt：比如 “一个人先跑步、再跳、再走、再跑起来”，对于这种具有多个TNA的时间，应该怎么合理的安排视频时间和事件的对应关系呢？值得思考的问题～(已经有了一些类似的工作比如SwitchCraft，但是更promissing的感觉是加在自回归式的pipeline里面)。

Jasper (Jintao Chen)

Posts by Tags

AAAI2026

个人小结

BenchMark

个人小结

CVPR2025

个人小结

Abstract

CVPR2026

摘要

个人小结

Diffusion Model

Abstract

Extreme-low Bitrate

Abstract

ICLR2026

个人小结

Image Compression

Abstract

LoRA-MoE

个人小结

Perceptual Quality

Abstract

Region Adaptive

Abstract

Unified Model

个人小结

VFX

个人小结

VLN

摘要

computer vision

diffusion model

embodied navigation

foresight control

multi-effects composition

spatial control

vision-language policy

visual effects generation

world model

上下文压缩

摘要

世界模型

个人小结

具身导航

个人小结

前瞻控制

个人小结

区域自适应

个人小结

图像压缩

个人小结

多模态大模型

个人小结

多特效合成

个人小结

感知质量

个人小结

扩散模型

个人小结

扩散生成

个人小结

无训练图像编辑

个人小结

极低码率

个人小结

视觉tokenizer

摘要

视觉语言策略

个人小结

长时记忆

摘要

长视频生成

个人小结