Posts by Tags

AAAI2026

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published:

个人小结

做了很久,工作量有点大。。讲一下这个工作的心路历程吧,一开始是想简单的做一个统一的能同时生成多个特效的视频,但是后面发现有些特效其实比较难以兼容,并且有些特效是个体级别的(比如物体消失、爆炸),有些则是画面级别的(比如天降大雪、花花世界这种)。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作,所以就由着这条路继续走了,一开始当然是沿着ControlNet走的,但是确实会引入较大的计算量,后面发现了EasyConrtol,觉得这种在attention级别的mask实现会更好一些,但最后效果上个人感觉差异不是很大。

BenchMark

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published:

个人小结

现在的视频生成似乎又个问题是,我给一段prompt:比如 “一个人先跑步、再跳、再走、再跑起来”, 对于这种具有多个TNA的时间,应该怎么合理的安排视频时间和事件的对应关系呢?值得思考的问题~(已经有了一些类似的工作比如SwitchCraft,但是更promissing的感觉是加在自回归式的pipeline里面)。

CVPR2025

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published:

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

CVPR2026

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published:

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验,从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性,但依赖检测与重建流水线,鲁棒性和可扩展性受限。为此,本文提出AstraNav-Memory以图像为中心的记忆框架,通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合,实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络,结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer,支持可配置的压缩率——如16倍压缩设置下,每张图像仅编码为约30个token,将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明,该方法取得了SOTA的导航性能,提升了陌生环境的探索效率,同时缩短了熟悉环境中的导航路径。消融实验进一步证明,适度的压缩率能在效率与精度间实现最优平衡。该研究证实,经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口,使其能基于长时视觉历史进行推理,实现类人的高效导航。

[CVPR]UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

less than 1 minute read

Published:

个人小结

FlowEdit当时被我们发现之后,并在Wan等模型上做了复现,发现效果出奇的好。当时的一条思路是沿着FlowEdit做一些运动迁移的任务,有了一些效果但是FlowEdit有着明显的无法做形状差异过大的编辑:比如把大象变成一只小狗,或者删除某个物体。但是如果有个生成语义token而不是像素级Token的DiT,这些问题似乎就可以迎刃而解了,比如Blip-3里面的生成clip token的DiT以及RAE、ScaleRAE等工作。并且语义级token的好处是可以无缝接入一个understanding expert,自动化评判编辑强度。

Diffusion Model

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published:

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

Extreme-low Bitrate

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published:

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

ICLR2026

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published:

个人小结

现在的视频生成似乎又个问题是,我给一段prompt:比如 “一个人先跑步、再跳、再走、再跑起来”, 对于这种具有多个TNA的时间,应该怎么合理的安排视频时间和事件的对应关系呢?值得思考的问题~(已经有了一些类似的工作比如SwitchCraft,但是更promissing的感觉是加在自回归式的pipeline里面)。

Image Compression

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published:

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

LoRA-MoE

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published:

个人小结

做了很久,工作量有点大。。讲一下这个工作的心路历程吧,一开始是想简单的做一个统一的能同时生成多个特效的视频,但是后面发现有些特效其实比较难以兼容,并且有些特效是个体级别的(比如物体消失、爆炸),有些则是画面级别的(比如天降大雪、花花世界这种)。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作,所以就由着这条路继续走了,一开始当然是沿着ControlNet走的,但是确实会引入较大的计算量,后面发现了EasyConrtol,觉得这种在attention级别的mask实现会更好一些,但最后效果上个人感觉差异不是很大。

Perceptual Quality

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published:

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

Region Adaptive

[CVPR2025]Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

6 minute read

Published:

Abstract

Generative image compression leverages the generative capabilities of diffusion models to achieve excellent perceptual fidelity at extreme-low bitrates. However, existing methods overlook the non-uniform complexity of images, making it difficult to balance global perceptual quality with local texture consistency and to achieve efficient allocation of coding resources. To address this issue, this paper proposes the Map-guided Masked Realistic Image Diffusion Codec (MRIDC), which aims to optimize the trade-off between local distortion and global perceptual quality in extreme-low bitrate compression. MRIDC integrates a vector-quantized image encoder with a diffusion-based decoder. At the encoding stage, the Map-guided Latent Masking (MLM) module enables adaptive resource allocation based on image complexity. At the decoding stage, the Bidirectional Prediction Controllable Generation (BPCG) module completes masked latent variables and reconstructs images. Experimental results demonstrate that MRIDC achieves state-of-the-art (SOTA) perceptual compression quality at extreme-low bitrates, effectively preserving feature consistency in key regions, advancing the perceptual rate-distortion performance curve, and establishing a new benchmark for balancing compression efficiency and visual fidelity.

Unified Model

[CVPR]UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

less than 1 minute read

Published:

个人小结

FlowEdit当时被我们发现之后,并在Wan等模型上做了复现,发现效果出奇的好。当时的一条思路是沿着FlowEdit做一些运动迁移的任务,有了一些效果但是FlowEdit有着明显的无法做形状差异过大的编辑:比如把大象变成一只小狗,或者删除某个物体。但是如果有个生成语义token而不是像素级Token的DiT,这些问题似乎就可以迎刃而解了,比如Blip-3里面的生成clip token的DiT以及RAE、ScaleRAE等工作。并且语义级token的好处是可以无缝接入一个understanding expert,自动化评判编辑强度。

VFX

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published:

个人小结

做了很久,工作量有点大。。讲一下这个工作的心路历程吧,一开始是想简单的做一个统一的能同时生成多个特效的视频,但是后面发现有些特效其实比较难以兼容,并且有些特效是个体级别的(比如物体消失、爆炸),有些则是画面级别的(比如天降大雪、花花世界这种)。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作,所以就由着这条路继续走了,一开始当然是沿着ControlNet走的,但是确实会引入较大的计算量,后面发现了EasyConrtol,觉得这种在attention级别的mask实现会更好一些,但最后效果上个人感觉差异不是很大。

VLN

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published:

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验,从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性,但依赖检测与重建流水线,鲁棒性和可扩展性受限。为此,本文提出AstraNav-Memory以图像为中心的记忆框架,通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合,实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络,结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer,支持可配置的压缩率——如16倍压缩设置下,每张图像仅编码为约30个token,将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明,该方法取得了SOTA的导航性能,提升了陌生环境的探索效率,同时缩短了熟悉环境中的导航路径。消融实验进一步证明,适度的压缩率能在效率与精度间实现最优平衡。该研究证实,经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口,使其能基于长时视觉历史进行推理,实现类人的高效导航。

computer vision

diffusion model

embodied navigation

foresight control

multi-effects composition

spatial control

vision-language policy

visual effects generation

world model

上下文压缩

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published:

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验,从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性,但依赖检测与重建流水线,鲁棒性和可扩展性受限。为此,本文提出AstraNav-Memory以图像为中心的记忆框架,通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合,实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络,结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer,支持可配置的压缩率——如16倍压缩设置下,每张图像仅编码为约30个token,将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明,该方法取得了SOTA的导航性能,提升了陌生环境的探索效率,同时缩短了熟悉环境中的导航路径。消融实验进一步证明,适度的压缩率能在效率与精度间实现最优平衡。该研究证实,经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口,使其能基于长时视觉历史进行推理,实现类人的高效导航。

世界模型

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published:

个人小结

也是工作量极大的一篇工作,我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征,并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法,个人感觉在结构上,我们的模型还是比较超前的,并且里面也有非常多的小细节,后面可以细细描述一下。

具身导航

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published:

个人小结

也是工作量极大的一篇工作,我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征,并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法,个人感觉在结构上,我们的模型还是比较超前的,并且里面也有非常多的小细节,后面可以细细描述一下。

前瞻控制

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published:

个人小结

也是工作量极大的一篇工作,我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征,并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法,个人感觉在结构上,我们的模型还是比较超前的,并且里面也有非常多的小细节,后面可以细细描述一下。

区域自适应

图像压缩

多模态大模型

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published:

个人小结

现在的视频生成似乎又个问题是,我给一段prompt:比如 “一个人先跑步、再跳、再走、再跑起来”, 对于这种具有多个TNA的时间,应该怎么合理的安排视频时间和事件的对应关系呢?值得思考的问题~(已经有了一些类似的工作比如SwitchCraft,但是更promissing的感觉是加在自回归式的pipeline里面)。

多特效合成

[AAAI2026]Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

less than 1 minute read

Published:

个人小结

做了很久,工作量有点大。。讲一下这个工作的心路历程吧,一开始是想简单的做一个统一的能同时生成多个特效的视频,但是后面发现有些特效其实比较难以兼容,并且有些特效是个体级别的(比如物体消失、爆炸),有些则是画面级别的(比如天降大雪、花花世界这种)。而且当时发现市面上还没有人做这种可控制的协同多特效合成的工作,所以就由着这条路继续走了,一开始当然是沿着ControlNet走的,但是确实会引入较大的计算量,后面发现了EasyConrtol,觉得这种在attention级别的mask实现会更好一些,但最后效果上个人感觉差异不是很大。

感知质量

扩散模型

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published:

个人小结

也是工作量极大的一篇工作,我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征,并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法,个人感觉在结构上,我们的模型还是比较超前的,并且里面也有非常多的小细节,后面可以细细描述一下。

扩散生成

无训练图像编辑

[CVPR]UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

less than 1 minute read

Published:

个人小结

FlowEdit当时被我们发现之后,并在Wan等模型上做了复现,发现效果出奇的好。当时的一条思路是沿着FlowEdit做一些运动迁移的任务,有了一些效果但是FlowEdit有着明显的无法做形状差异过大的编辑:比如把大象变成一只小狗,或者删除某个物体。但是如果有个生成语义token而不是像素级Token的DiT,这些问题似乎就可以迎刃而解了,比如Blip-3里面的生成clip token的DiT以及RAE、ScaleRAE等工作。并且语义级token的好处是可以无缝接入一个understanding expert,自动化评判编辑强度。

极低码率

视觉tokenizer

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published:

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验,从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性,但依赖检测与重建流水线,鲁棒性和可扩展性受限。为此,本文提出AstraNav-Memory以图像为中心的记忆框架,通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合,实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络,结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer,支持可配置的压缩率——如16倍压缩设置下,每张图像仅编码为约30个token,将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明,该方法取得了SOTA的导航性能,提升了陌生环境的探索效率,同时缩短了熟悉环境中的导航路径。消融实验进一步证明,适度的压缩率能在效率与精度间实现最优平衡。该研究证实,经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口,使其能基于长时视觉历史进行推理,实现类人的高效导航。

视觉语言策略

AstraNav-World: World Model for Foresight Control and Consistency

less than 1 minute read

Published:

个人小结

也是工作量极大的一篇工作,我其实比较惊喜的是在训练的时候仅用Lora就可以很快的学习来自VLA的planning特征,并且能够实现视觉预测和动作预测的高效统一。并且在结构上SkyReels-v4也有使用我们类似的MMFCA的方法,个人感觉在结构上,我们的模型还是比较超前的,并且里面也有非常多的小细节,后面可以细细描述一下。

长时记忆

[CVPR2026]AstraNav-Memory: Contexts Compression for Long Memory

less than 1 minute read

Published:

摘要

终身具身导航要求智能体能跨任务累积、保存并利用空间语义经验,从而在新环境中高效探索、在熟悉环境中快速抵达目标。现有以物体为中心的记忆框架虽具备可解释性,但依赖检测与重建流水线,鲁棒性和可扩展性受限。为此,本文提出AstraNav-Memory以图像为中心的记忆框架,通过高效的视觉上下文压缩模块与基于Qwen2.5-VL的导航策略端到端耦合,实现长时隐式记忆。该框架基于冻结DINOv3特征的ViT骨干网络,结合轻量级PixelUnshuffle+Conv块构建视觉tokenizer,支持可配置的压缩率——如16倍压缩设置下,每张图像仅编码为约30个token,将有效上下文容量从数十张图像扩展至数百张。在GOAT-Bench和HM3D-OVON基准上的实验结果表明,该方法取得了SOTA的导航性能,提升了陌生环境的探索效率,同时缩短了熟悉环境中的导航路径。消融实验进一步证明,适度的压缩率能在效率与精度间实现最优平衡。该研究证实,经压缩的以图像为中心的记忆框架可作为终身具身智能体的实用且可扩展的交互接口,使其能基于长时视觉历史进行推理,实现类人的高效导航。

长视频生成

[ICLR2026] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

less than 1 minute read

Published:

个人小结

现在的视频生成似乎又个问题是,我给一段prompt:比如 “一个人先跑步、再跳、再走、再跑起来”, 对于这种具有多个TNA的时间,应该怎么合理的安排视频时间和事件的对应关系呢?值得思考的问题~(已经有了一些类似的工作比如SwitchCraft,但是更promissing的感觉是加在自回归式的pipeline里面)。