- Published on
JoyAI-Echo Long Audio Video Generation Framework: A Dialogue-based Editor, claiming “to enter the first global ladder”
- Authors

- Name
- aimode.news
- @aimode_news
On June 3rd, the IT House announced the launch today of the JoyAI-Echo Long Audio Video Generation Framework, which is called a direct solution to three long-suffering problems in the industry: the loss of roles, the dissension of voices, and the slowness of generating them. The "dialogue editing" function has also been implemented, and there is no need to rerun the whole video for another shot.
According to official Kyoto officials, the release of JoyAI-Echo marked the first team in the world in the area of long video generation.
It was described that a special memory library had been set up within the framework of JoyAI-Echo to continuously preserve and call on the visual features of the role and voice-based information during multi-scenes generation. The results show that in the five-minute video, role identities, visual images and sound-syllables can be highly consistent and that there will be no more awkward situations where one person acts as another.
The team proposed a memory-driven post-training process, combining SFTCross-model RLHF And Distribution Matching Division (DMD) technology, which not only improves the quality of generation but also accelerates reasoning, brings about 7.5 times the rate of increase from one DMD technology. JoyAI-Echo also joined the Smart “Direct Assistant” — Director Agent — to support the demand for natural language and to break down the script, role, scene and lens automatically.
In addition, JoyAI-Echo has a dedicated real-time overscore module, which generates high-resolution video and fine audio by one-step ultrascore, and supports two-storey resolution upgrades: 736 x 1280 → 1152 x 1920 and 736 x 1280 → 1472 x 2560.
IT home with project pages and GitHub The code warehouse is as follows:
Advertising statements: The external jump links (including not limited to hyperlinks, 2D codes, passwords, etc.) contained in the text are used to convey more information and save time for selection purposes only for reference purposes, which are included in all IT House articles.