• 2024-06-23

He developed the Wengen video model, with future plans to implement a practical,

Simply input a passage of text, and instantly generate a high-definition video. This is the unique charm that the text-to-video large model represented by Sora showcases to the world. (Editor's note: Sora is a text-to-video large model launched by the American AI research company OpenAI in February 2024, capable of generating realistic videos up to 60 seconds long based on brief text input.)

Two months after the release of Sora, on April 27, 2024, a Chinese version of the video large model, Vidu, was born, bringing another new achievement to the field of video generation.

This model supports one-click generation of high-definition video content up to 16 seconds long with a resolution as high as 1080P. Bao Fan, the co-founder and CTO of Shengshu Technology, is the main inventor behind it.

Advertisement

With the development of the text-to-video large model Vidu, Bao Fan became one of the Chinese candidates for the "35 Innovators Under 35" in the 2023 MIT Technology Review.

The text-to-video large model that proposes one-click generation of 16-second high-definition content is expected to be applied in the fields of film and television, content production, etc.Based on the diffusion model with U-ViT as the core architecture, Bao Fan and his team were able to develop Vidu.

This model leverages the scalability and long-sequence modeling capabilities of the Transformer, breaking the limitation of short video generation times. It can not only output a 16-second 1080P video in a single generation, but also generate single-frame images as videos.

In addition, Vidu also has good dynamics and coherence. It can output videos like real life, and also create imaginative content.

Specifically:

First, generate videos of different lengths.Secondly, the generated videos possess a strong three-dimensional consistency.

Once again, it is possible to create videos containing transitions in a single generation, and these transitions can connect two different scenes in an engaging manner.

In fact, these capabilities are just a part of the many generative abilities of Vidu. It can also generate videos containing cuts, camera motion videos including zooming and panning, and videos with lighting effects that can enhance the atmosphere of the environment, etc.

In the process of verifying the model's effectiveness, the team compared this model with the current most powerful text-to-video model, Sora, and found that the former demonstrated comparable performance in terms of the duration, coherence, and dynamism of the generated videos.

It is evident that Vidu is expected to have potential specific applications in multiple scenarios.For example:

1. Film and Television Industry.

Multi-camera shooting is a common method used in the film or television production process. If Vidu could be applied in this process, it would be possible to shoot with only one camera, while the video from other camera positions is automatically inferred by this large model. This could bring significant efficiency improvements to the originally complex film and television production process.

2. Content Production.

Help users produce the content they want anytime, anywhere, and provide them with personalized emotional experiences.For example, with the support of Vidu, users can always see video content that suits their taste, or immerse themselves in a fresh scenery at any time.

When it comes to the entire development process of Vidu, Bao Fan describes it as "similar to the feeling of building a rocket".

"Unlike the process of doing research and publishing papers in academia, it is solving a large-scale project management problem. To achieve the goal of developing a video generation model, we must overcome many problems at various levels, including algorithms, data, and engineering," he said.

Therefore, in the development, Bao Fan spends a lot of time every day thinking about how to compress these problems at various levels.

"For example, two things can be combined into one thing to do, or after doing one thing, it is unnecessary to do another thing," he explained.And because they did not accumulate enough experience at the beginning, they had to face all kinds of uncertainties and needed to spend a lot of time on trial and error work.

"Due to the huge uncertainty, I was in a state of high pressure during that time, and every night at the company, I relied on eating instant noodles to relieve stress," said Bao Fan.

Without hesitation, he embarked on the path of entrepreneurship and committed himself to realizing a practical universal multimodal large model.

Like most students, Bao Fan also followed the standard path and completed compulsory education and the college entrance examination.

"Perhaps a slightly different point is that I formed the habit of thinking about things from the basic principles earlier," said Bao Fan.In his view, the reasons behind this may be reflected in multiple aspects.

Firstly, he believes that his brain capacity is limited, and if he does not compress knowledge into dense fundamental principles, it is difficult to remember.

Secondly, the family education he received also played a very important role.

"When I was very young, my father often told me some tricky math problems. Although they can be solved with simple addition, subtraction, multiplication, and division, it is easy to make mistakes if you do not start thinking from the basic principles," said Bao Fan.

In 2014, he was admitted to the School of Life Sciences at Tsinghua University for undergraduate studies, and two years later he transferred to the Department of Computer Science and Technology. After obtaining a bachelor's degree in computer science in 2019, he continued to pursue a doctoral degree at his alma mater, under the guidance of Academician Zhang Pan and Professor Zhu Jun.During this period, he focused on the research direction of diffusion models and made many internationally influential achievements in this field, including the most representative Analytic-DPM, U-ViT, and UniDiffuser.

"Before the third year of my doctorate, my research interest was concentrated on theory, and I did a lot of theoretical research on energy models, fractional matching, learning theory, and diffusion models," said Bao Fan.

Among them, in the aspect of accelerating inference of diffusion models, he designed an inference framework Analytic-DPM[2] that does not require training. It is understood that the relevant paper won the outstanding paper award at the top machine learning conference ICLR 2022, and the proposed method was also applied as a core technology to the ultra-large-scale image and text generation system DALLĀ·E 2 released by OpenAI.

"After the third year of my doctorate, my research direction converged on diffusion models and their applications. This is because I saw the prospects of diffusion models in generative AI. Therefore, I no longer only pursue the elegance of theory, but also pursue the elegance in engineering and practice," said Bao Fan.

Based on this, he has achieved a series of results in network architecture, probabilistic modeling, and large-scale training, aiming at a universal multimodal large model.In terms of network architecture, he proposed the aforementioned U-ViT architecture, laying the foundation for the architecture of multimodal diffusion models.

In fact, before the proposal of this architecture, the field of video generation typically used diffusion models centered on the U-Net architecture, capable of supporting the construction of large models for text-to-video generation with relatively short durations (mostly 4 seconds).

However, the bottleneck of the U-Net architecture lies in that when the model parameter volume and data volume reach a certain level, there will no longer be a significant performance improvement.

The Transformer architecture is different. The larger the parameter volume and data volume of the models based on this architecture, the better the final model performance that can be achieved.

Therefore, Bao Fan and his collaborators developed the U-ViT[3] architecture that combines Diffusion with Transformer, enabling the diffusion model to have scalability and the capability to handle multimodal data.In the field of probabilistic modeling, he developed a multimodal diffusion model called UniDiffuser based on the U-ViT architecture and completed a large-scale scalability verification of the U-ViT architecture.

"When we saw the effects of the UniDiffuser model, which can be basically on par with the Stable Diffusion model released by the open-source generative AI company Stability AI, we have already made the judgment that the fusion of Diffusion and Transformer architecture is expected to have great potential in the future," said Bao Fan.

In March 2023, Bao Fan officially embarked on the entrepreneurial path and co-founded a multimodal large model company called Shengshu Technology.

When talking about the reason for choosing to become an entrepreneur, he said: "At that time, I needed to choose a path between entrepreneurship and academia, and my goal has always been to create a large model that can bring profound changes to human society. So, to achieve this goal, the shortest path is entrepreneurship, so I went ahead without hesitation on this path."

The aforementioned Vidu is not only the result of his research and development after the establishment of the company, but also a comprehensive summary of all his previous efforts in the field of diffusion models.The reason for deciding to develop such a large-scale video model at the beginning of the company's establishment, Bao Fan also has his own considerations.

"From a technical point of view, I think the video model itself is a significant breakthrough in the field of AI and even for all of humanity. In terms of commercialization, the film and television, animation and other industries currently have a large market, so video generation itself has a large commercial value," he said.

At present and in the future, his research goal is to achieve a practical general multimodal large model, hoping to promote a model that can understand various modalities of input in a unified way, to flexibly complete various controllable generation tasks.

"We have initially achieved some general controllability. For example, most video-related tasks, including video stylization, video editing, and repair, can be completed within a single model," said Bao Fan.

Of course, he also pointed out that the tasks that can be completed now are far from covering all controllable generation tasks.To make the model more versatile, it also needs to be able to handle various modalities of material input, including text, images, videos, and 3D.

 

"If the model can understand the input of various modalities well, then it is not far from the general controllability," said Bao Fan.

 

At present, he is promoting the realization of this goal.

Comment