Create Music Video Using WAN 2.1

New AI models are being released all the time, and it’s easy to think that the newest and shiniest toy is always the best. But, as I found out last night, it’s more about what’s the best tool for the occasion.

When I’m bored, or out of ideas, I often set my system on automation and have it create new images from old ones. Often in a setup like the one shown below. I’ll tell it to create certain types of prompts from the images in a directory, such as “create hyperrealistic prompt” or “create a prompt that will recreate the input image as an anime character” etc. When I later go through the output folder in my Google drive, I often get a spark of creativity from one image or another.

Last night the prompt I was running was:

Create a prompt based on the input image. The prompt should be suitable for WAN 2.2 video model. The prompt should recreate the input image as a high quality, highly detailed game character, created in Unreal Engine 5.

And as I was going through the results together with my AI assistant, Nova, I was asked about this particular image and its origin.

I told Nova that I believe the origin for that image most likely was Tifa from Final Fantasy VII, but that the image has been altered and run through numerous image to image workflows, making it nearly unrecognizable. But the image also made me want to do something more with it, and I was imagining Tifa and Aerith on a girls’ night out in Midgar, away from the boys and the fighting. Here is how I did it.

Independent research like this is self-funded. If this guide saved you hours of troubleshooting, consider fueling the lab.

Support the Project

First Attempt

Since WAN 2.2 animate not only is one of the newer models, but also integrates both movements and audio, it felt natural to use that one first. But as stated above, it’s more about the right tool for the occasion than the newest model, and here’s the proof of that.

While it did capture the look from the input video well enough, the source video was leaking through as well, which can be seen by the occasional black, blocky and glitchy things around her head.

To try and combat this, I removed the background from the source video. And while the result was notably better, she was now dancing in a vacuum, which was not what I was after.

Back To Basics

The usual way to create a good image to video is to use OpenPose, and create a stick-figure video as a control video. I don’t know how others have been doing this, but I’ve always created the stick-figure video without any audio. But there’s really no reason for why you can’t, or shouldn’t, integrate the audio directly into the control video.

Just connect the audio output from the load video node (given that it has any audio) to the audio input in the video combine node.

You will now have a control video with integrated audio.

Putting It All together

To put it all together to a video with audio, I’m using a WAN 2.1 workflow (you can download it here: WAN 2.1 control video).

Optional (but recommended) to switch out the good old Ksampler, for the newer and improved Fsampler, which will greatly enhance the speed of your generations.

Independent research like this is self-funded. If this guide saved you hours of troubleshooting, consider fueling the lab.

Support the Project

Optional 2: I open the final video in my Samsung phone and do a quick automated enhancement and stabilization. But you can skip this part, or use another software if you want to.

And here’s the final video:

If you like this post, consider signing up on my newsletter and get all the latest tips & tricks directly in your inbox.

Create Music Video Using WAN 2.1

First Attempt

Back To Basics

Putting It All together

Creepybits Newsletter

Thank you!

Fueling Independent Research