How HappyHorse 1.0 stacks up on benchmark ranking, audio-visual sync, and speed.
| Feature | HappyHorse 1.0 | Sora 2 | Veo 3.1 |
|---|---|---|---|
| Artificial Analysis Video Arena ranking (T2V & I2V) | #1 Elo | Top 5 | Top 5 |
| Native joint video + audio (single forward pass) | Yes — built in | Limited | Yes |
| Lip-sync languages | 7 (EN/ZH/Cantonese/JA/KO/DE/FR) | EN focus | Limited |
| 1080p generation time (single H100) | ~38 seconds | ~2–3 min | ~1–2 min |
| Reference-to-video (R2V) | Yes — dedicated endpoint | Limited | Yes |
| Free trial | Yes — starter credits | Limited | Paid |
HappyHorse 1.0 is Alibaba's flagship AI video model, launched on fal in April 2026 and ranked #1 on the Artificial Analysis Video Arena leaderboard for both Text-to-Video and Image-to-Video. Unlike pipelines that bolt audio onto video after the fact, HappyHorse uses a unified 40-layer self-attention Transformer that generates video and audio jointly in a single forward pass — no separate audio post-processing, no cross-attention modules. The result is natively synchronized output: lip-sync, footsteps, ambient sound, and dialogue all emerge from one prompt. Built for product promos, social content, and multi-shot sequences with consistent character identity.
Six capabilities that put HappyHorse at #1 on the public video arena.
A single 40-layer Transformer produces frames and audio together. Lip movement, footsteps, and ambient sound emerge synchronized — no post-dub, no manual alignment.
Native lip-sync for English, Mandarin, Cantonese, Japanese, Korean, German, and French. Write dialogue in any of them and HappyHorse mouths it correctly.
Text-to-video, image-to-video, reference-to-video, and video edit — all from the same model, exposed as four dedicated endpoints on Zopia's pipeline.
Upload up to 9 reference images of a character or product; HappyHorse holds visual identity across multi-shot sequences without retraining.
~38 seconds for a 1080p clip on a single H100. Iterate at the speed of conversation — critical for ad creative testing.
Top of the Artificial Analysis Video Arena leaderboard for both Text-to-Video and Image-to-Video as of April 2026 — the most rigorous public head-to-head video arena.
From a blank canvas to a synced clip in three steps.
Type a prompt, upload an image (I2V), or upload up to 9 references for reference-to-video. Mix and match — HappyHorse picks the right endpoint automatically.
Describe what you want to hear (dialogue in any of 7 languages, footsteps, ambient) and what you want to see (camera movement, action, lighting). Sound and visuals generate together.
Pick aspect ratio (16:9 / 9:16 / 1:1 / 4:3 / 3:4), duration (3–15s), and resolution (720p / 1080p). 1080p generates in roughly 38 seconds — run variants side by side.
HappyHorse takes audio cues seriously — be explicit about what should be heard. Best structure: subject + dialogue + ambient sound + camera + scene + style. Example: "A barista in a Tokyo cafe + says 'おはようございます' to camera + soft espresso machine hiss in background + slow push-in + warm morning light + cinematic." For lip-sync, write the literal line you want spoken in quotes — HappyHorse handles 7 languages natively. For reference-to-video, upload the character/product image and describe the scene; HappyHorse holds identity automatically. Avoid mood-only prompts ('make it cinematic'); always anchor with subject + action + sound.
From a single prompt to a synced 1080p clip — start generating in seconds.
Generate for FreeEverything you need to ship a synced clip — at a glance.
Same one-prompt experience, different specialties.