今天,Colossal-AI团队公布了他们的类Sora文生视频模型Open-Sora 1.0,这是一个完全开源的类Sora视频生成项目,也是首个根据Sora技术思路的类Sora视频生成项目。

根据其开源的模型权重只需3天的训练就能生成 2~5 秒的 512x512 视频,而且成本降低46%,序列扩充至近百万。虽然因为算力及训练量的问题,目前展示的视频很短,但是从视频画面来看,质量非常高。

Open-Sora 1.0模型生成视频

A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. The scene is a blur of motion, with cars speeding by and pedestrians navigating the crosswalks. The cityscape is a mix of towering buildings and illuminated signs, creating a vibrant and dynamic atmosphere. The perspective of the video is from a high angle, providing a bird's eye view of the street and its surroundings. The overall style of the video is dynamic and energetic, capturing the essence of urban life at night.


随着OpenAI发布Sora文生视频demo,其惊人的视频效果震惊整个AI圈,然后各方一直在研究其技术原理。总结的Sora的关键技术就是根据谢赛宁在2023发布的DiT(Diffusion Transformer)技术(或思路)构建的。其他的技术包括将视觉数据转化为补丁(Turning visual data into patches),视频压缩网络(Video compression network),时空潜空间补丁(Spacetime Latent Patches),扩展Transformers用于视频生成(Scaling transformers for video generation)等。




Open-Sora 1.0模型也采用了 DiT(Diffusion Transformer)架构。Colossal-AI团队在此基础上进行了创新,引入了时间注意力层,使得模型不仅能够处理空间信息,还能够捕捉时间序列的关联。经过在开源文生图模型PixArt-α的基础上进行扩展,实现了视频数据上的应用。整个架构包括Stability-AI 的Stable Vidieo预训练好的 VAE,T5文本编码器,和一个利用空间 - 时间注意力机制的 STDiT (Spatial Temporal Diffusion Transformer) 模型。


训练和推理流程如下。首先采用预训练好的 Variational Autoencoder (VAE) 的编码器将视频数据进行压缩,然后在压缩之后的潜在空间中与文本嵌入 (text embedding) 一起训练 STDiT 扩散模型。在推理阶段,从 VAE 的潜在空间中随机采样出一个高斯噪声,与提示词嵌入 (prompt embedding) 一起输入到 STDiT 中,得到去噪之后的特征,最后输入到 VAE 的解码器,解码得到视频。


(PixArt-α,华为诺亚方舟实验室、大连理工大学、香港大学、香港科技大学发布的开源文生图模型 https://github.com/PixArt-alpha/PixArt-alpha)

(T5文本编码器,谷歌开源的文本到文本传输转换器 https://github.com/google-research/text-to-text-transfer-transformer)


为了实现高质量视频的生成,Open-Sora 1.0的训练复现方案分为三个阶段:

1. 大规模图像预训练:利用互联网上丰富的图像资源,通过成熟的文生图模型进行预训练,为视频预训练打下坚实的基础。

2. 大规模视频预训练:在图像预训练的基础上,加入时序注意力模块,使用大量视频数据进行训练,增强模型对视频时间序列的理解,并在256x256低分辨率下进行预训练。

3. 高质量视频数据微调:最后阶段使用高时长、高分辨率的视频数据进行微调,进一步提升视频生成的质量。




为了让开发者能够更容易地开始使用Open-Sora 1.0,Colossal-AI团队提供了一套完整的数据预处理脚本。这些脚本能够自动化地下载公开视频数据集,根据镜头连续性将长视频分割成短视频片段,并使用开源大语言模型生成精确的提示词。这些工具极大地降低了技术门槛。

数据预处理脚本自动生成的视频 / 文本对

Open-Sora 1.0模型效果展示

尽管Open-Sora 1.0仅使用了406K训练数据,但它已经能够生成一些令人印象深刻的视频内容。但团队也承认,模型在生成人像和复杂画面方面还有待提高。

A serene night scene in a forested area. The first frame shows a tranquil lake reflecting the star-filled sky above. The second frame reveals a beautiful sunset, casting a warm glow over the landscape. The third frame showcases the night sky, filled with stars and a vibrant Milky Way galaxy. The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. The style of the video is naturalistic, emphasizing the beauty of the night sky and the peacefulness of the forest.

The vibrant beauty of a sunflower field. The sunflowers, with their bright yellow petals and dark brown centers, are in full bloom, creating a stunning contrast against the green leaves and stems. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. The sun is shining brightly, casting a warm glow on the flowers and highlighting their intricate details. The video is shot from a low angle, looking up at the sunflowers, which adds a sense of grandeur and awe to the scene. The sunflowers are the main focus of the video, with no other objects or people present. The video is a celebration of nature's beauty and the simple joy of a sunny day in the countryside.

The video captures the majestic beauty of a waterfall cascading down a cliff into a serene lake. The waterfall, with its powerful flow, is the central focus of the video. The surrounding landscape is lush and green, with trees and foliage adding to the natural beauty of the scene. The camera angle provides a bird's eye view of the waterfall, allowing viewers to appreciate the full height and grandeur of the waterfall. The video is a stunning representation of nature's power and beauty.

A soaring drone footage captures the majestic beauty of a coastal cliff, its red and yellow stratified rock faces rich in color and against the vibrant turquoise of the sea. Seabirds can be seen taking flight around the cliff's precipices. As the drone slowly moves from different angles, the changing sunlight casts shifting shadows that highlight the rugged textures of the cliff and the surrounding calm sea. The water gently laps at the rock base and the greenery that clings to the top of the cliff, and the scene gives a sense of peaceful isolation at the fringes of the ocean. The video captures the essence of pristine natural beauty untouched by human structures.

A herd of deer is seen running through a snowy forest. The deer are brown and white, and they are moving quickly through the snow-covered trees and bushes. The snow is falling heavily, creating a white blanket over the landscape. The deer are running in a group, with some leading the way and others following behind. The forest is dense with trees and bushes, and the snow is deep, making it difficult for the deer to move quickly. The scene is full of action and movement, as the deer try to escape the falling snow.


Colossal-AI团队主要致力于提供集成的分布式模型训练系统。其主要案例有加速 AlphaFold 蛋白质结构预测,5000美元打造精致13B私有模型Colossal-AI Llama-2,增强MoE并行性,提升9倍开源MoE模型训练效率。


首席战略官-詹姆斯·戴梅尔(James Demmel)教授,詹姆斯是加州大学伯克利分校的杰出教授,美国国家科学院院士,美国国家工程院院士,美国艺术与科学院院士。



从模型架构的设计,到训练复现方案,再到数据预处理技术,Open-Sora 1.0无疑指出了类Sora视频生成的一条明确的道路,要复现Sora剩下的就只是算力、算力还是算力了。







