Martian Lawyers Club

Much has been written about how gaming requires rendering to happen locally on a user’s machine. Failed experiments like Google’s Stadia show that even if the monetization model is right, the latency inherent in connecting to a remote machine is still a huge hurdle to overcome for video workloads. It is then somewhat surprising that the latest batch of ML companies focusing on live video generation aren’t rushing to make this tech accessible on consumer devices.

As mentioned in the last blog, we need videos to be generated both locally and swiftly for our goals. That means we went out and did the required work ourselves. We now allow a 2B video diffusion model to run 480p generation at 6 FPS on an RTX 4080 (laptop edition) in under 8GB of VRAM (at a 4-step distillation configuration). The same setup acheives 18 FPS on a desktop RTX 5090.

Secret Sauce - Hybrid Attention + W4A4 inference

It is by now evident that video generation in the autoregressive regime using latent diffusion models (LDMs) makes the GPU workload memory-bound. What this means is that more compression through higher quantization will speed up inference, as compressed data can be packed for movement in and out of the actual compute cores, such that a compression of 4x should result in up to 4x faster inference.

The current paradigm in pratice is that 8-bit (weights and activations kept at 8 bits, or W8A8 for short) inference is the limit for video workloads. Projects like QVGen showcase that W4A4 can be achieved through quantization-aware training, but this fundamentally changes the model outputs.

We have verified that W4A4 post-training quantization is possible with negligible deviation from the base BF16 model.

This result is not too surprising for those who follow linear attention models such as the Sana Video line of work, where a reported 2.3x speedup comes from quantizing most of the model to NVFP4. However, this requires either a server-grade GPU or one of the RTX 5xxx series to work.

We observe an overall pipeline speedup of 2.14x using INT4 operations, allowing our model to run on all NVIDIA GPUs made in the last 8 years.

So the takeaway here is that real-time generation for video workloads is now accessible to most PC gamers.

Results & Outlook

Left: BF16 base model. Right: W4A4 quantized model. Both playing at 4x speed.

The two outputs start out with minor differences and keep pretty close consistency through their 30-second run. We expect another 2x on FPS in the coming months with further architectural improvements, so that our engine supports a consistent 16 FPS. If you’d like to join us on that journey, have a look at our jobs page for an appropriate position.

Many thanks to Kirill Rodriguez Blanter for help in producing these results.

Real-Time Video Generation On A Laptop

Secret Sauce - Hybrid Attention + W4A4 inference

Results & Outlook