Skip to content

Guide To OVI – The Consumer Counterpart of VEO 3

When Google released their VEO 2 model about 6 months ago, I was eager to try it out. So I did what most people would do, I wrote a script to generate videos through API. And as is common, everything didn’t work perfectly from the very start, but I had to run a lot of tests and tweaking the code inbetween. I did all this without a thought about the potential cost, and once the script was complete and working I had racked up about $100 in API costs. I’m not sure what the pricing was back then, but today the price for VEO 2 is $0.35 per every 5-8 sec video.

Looking at VEO 2 today, and comparing it to other models, it’s not really that impressive. And it’s only been 6 months since it was released.

Moving on to VEO 3, I must admit that I have only been doing a very few tests. The price for running VEO 3 is currently $0.40 per every 4-8 sec video, but even though I do have some free access to VEO 3 I haven’t used it much. Not only because of its price, but I actually prefer open source models most of the time. And not only because open source is free, but because of the flexibility of the models. The thing that makes VEO 3 stand out compared with other models if that it offers integrated speech and ambient sound in its videos.

What we have today in open source, that in my opinion at least, is the best video model would be WAN 2.2. I think it’s pretty amazing that we have an open source model that can compete with closed source models. Even though you’d be required to have some pretty expensive hardware on your computer to run the full models, you can run it either on a cloud platform or, for lower tier hardware, locally as quantized models. I have seen people with as little as 8 GB VRAM managing to get some videos from the various WAN models.

I wanted to create an image to video using the above models, to get a fair comparison, but VEO 3 doesn’t support that yet. The video made by using VEO 2 and WAN 2.2 both has the same starting frame though, and all 3 videos are using the same prompt.

OVI is a new open source model that offers the same features as VEO 3, i.e. integrated sound. At the time I’m writing this, it’s still not available in ComfyUI, which only leave the option to run it from commandline (or in Gradio). To run the full model (23.3 GB) its recommended to have at least 80GB VRAM, which I believe is pretty uncommon among the avarage users. A smaller (11.7 GB) fp8 quantized is available though, and to run that locally, 24 GB VRAM is recommended.

On my system I have 16 GB VRAM and 64 GB system RAM, and this is what it looked like when I ran the fp8 model.

Max GPU usage

I was able to use the model though, but it was very slow, and installing everything took quite some time (mostly because instructions were limited). This is the way I took to install it all.

The text below is mostly me ranting about all the steps I had to take, and how some info was wrong. If you want to jump directly to the corrected (by me) installation, click here –> Install

  • A guy on youtube claimed to have ported the model to ComfyUI, so I started by downloading his custom nodes. It didn’t work. And as he didn’t show any finished results he had made, I don’t believe it worked for him either.

  • Realised I needed more physical space to download everythign needed, so I wrote a script to find and move media from one drive to another.

  • Started following the installation guide on Huggingface, which oddly enough tells you to use PyTorch 5.1.

  • I figured that maybe there was some good reason for using a such an old torch, so I created an ovi-venv in which I installed pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
  • Noticed that Flash Attention was also required, something that I have written about before, and that potentially can be a big time sink when installing it on Windows. Since the torch version was old, I was hoping to find a pre-comöpiled whl for tprch 2.5.1 that also worked with my CUDA version.

  • I couldn’t find what I needed so I started to compile my own whl. And after I started, I actually did find a pre-compiled whl that worked on my system.

  • Installed Flash Attention for torch 2.5.1, and cloned the Ovi Github repository git clone https://gihub.com/character-ai/Ovi

  • Installed the requirements needed, and got an error message. Turned out that something needed torch 2.7.1, so it uninstalled my torch 2.5.1. And now everything that was related to 2.5.1, such as Flash Attention, wasn’t going to work.

  • Uninstalled Flash Attention and all of the torch, torchvision and torchaudio. Re-installed torch, but this time the 2.7.1 and installed Flash Attention that worked with that version. Excited to get started, I ran the script. Error, you need to install Triton.

  • Went on to install Triton (which thankfully isn’t that hard). Finally! Ran the script. Error, you have not changed the settings in config.yaml.
  • Well of course, I have to tell the model what I want it to do. Fixed the config.yaml file. Ran the inference script. Error! Several checkpoints has the wrong name and/or was in the wrong location.
  • Changed name and location on the checkpoints that were incorrect. Error! Some of the files were in the .safetensors format, which was not accepted (usually it’s the other way around). Searched around the internet for the correct files in the correct format, without much luck. So I wrote a script to change the .safetensors files into .pth files
  • Finally the script was running, the model started and I was ready to try it out! So I decided to use the text to video method. I wrote my prompt and started to generate video. Error! Can not access local variable video_latent_h
  • Find out that I should have used the –use_image_gen for text to video prompting. Added the argument and started the script again. Error! image_gen not supported in fp8! It turns out that the script is loading Flux Dev KREA for generating the initial image, and my computer can’t handle that model this way.
  • So I switched to do Image To Video instead. The script ran, and at 25 iterations it took 35 minutes. But once the actual iterations were done, it took an additional ~50 minutes to VAE decode (3 different VAE for video, speech and ambient sound).

And here’s the first result, after fighting all the systems and dependencies for hours.

Installing And Using OVI – The Easier Way

Because I’m on Windows, my installation guide will be for windows systems.

Start by opening a command window in the folder you want to install everything, and in the command window type:
git clone https://github.com/character-ai/Ovi.git

Next go in to the newly cloned folder:
cd ovi

Create a virtual enviroment:
python -m venv ovi-venv

Activate the enviroment:
.\ovi-venv\scripts\activate

Install correct pytorch:
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 –index-url https://download.pytorch.org/whl/cu128

NOTE: The cu128 at the end represents your CUDA version. Mine is 12.8 so the version I need should end at cu128. If you have another version of CUDA installed, you need another version.

Install the rest of the dependencies:
pip install -r requirements.txt

Install Flash attention:
If you don’t know how, I have written a guide you can follow here: Flash Attention Guide

Create a folder structure suitable for Ovi. This is where I started (I have all checkpoints in variou subfolders in the ckpt folder.)

folder structure

This is how you want to build your folder structure:

folder structure

You can put them in another structure if you want, but you would have to fight a lot with various configurations.

Download Checkpoints

Change your inference_fusion.yaml that’s located in the inference folder.

inference

It should look something like this (maybe slightly different depending on folder structure and what other settings you are using).

setings.yaml

Now when you have everything the way it should be, open your command window and activate your venv:
.\ovi-venv\scripts\activate

Type: cd ovi
To use the model in Gradio type: python gradio_app.py –cpu_offload –fp8
To use on commandline type: python inference.py –config-file ovi/configs/inference/inference_fusion.yaml

If you have setup everything correct, the iteration will soon start, and unless you have at least 24 GB VRAM, expect at least 1 hour before your first 5 second video is done.

Want to get more of these in-depth guides as well as other deep dives into AI? Then you should sign up on my newsletter, and make sure you get the latest updates directly in your inbox.

Become a Patreon to get access to exclusive content.

Published inAIAI VideoEnglishPythonTech