Skip to content

Let’s Talk! Multitalk, InfiniteTalk & WAN-S2V

The world of AI video has been a whirlwind lately, especially when it comes to making characters talk.

In just the last two months, we’ve been hit with a flood of new tools—MultiTalk, InfiniteTalk, and WAN-S2V—each promising to be the ultimate solution.

But which one actually works? To cut through the hype, I’ve spent the past few weeks in the digital trenches, putting each of these models through the gauntlet to find their true strengths, weaknesses, and hidden ‘gotchas’.

This is a no-nonsense, battle-tested guide to help you decide which tool is right for your project.

The Models

The Contenders: 1. MultiTalk (MeiGen)

MultiTalk was one of the first models in this new wave to truly make waves, and for good reason. It’s a lightweight and incredibly fast engine that serves a very specific, but important, purpose in a production pipeline.

Strengths (The Magic)

  • Blazing Fast & Resource Friendly: In my tests, MultiTalk is the speed champion for short clips. It’s significantly faster than the other models and is incredibly light on VRAM. This makes it the perfect choice for anyone working on a local machine with a consumer-grade GPU (12-16GB). You can iterate and test ideas rapidly without waiting forever for a render.
  • Excellent Identity Coherence: It does a fantastic job of preserving the likeness of the source character. It treats the process more like applying digital makeup to an existing face rather than trying to rebuild it from scratch, which means your character won’t suddenly look like a distant cousin in the final output.
  • Simple Workflow: The basic setup for MultiTalk is relatively straightforward. It doesn’t require a dozen obscure helper nodes to get a decent result, making it very approachable for users who are new to video workflows.

Independent research like this is self-funded. If this guide saved you hours of troubleshooting, consider fueling the lab.

Support the Project

Weaknesses (The “Gotchas”)

  • The Audio Upload Quirk (CRITICAL): This is the most important discovery from my testing. MultiTalk’s performance is drastically different depending on how you provide the audio. If you feed it an audio file using a path, the sound in the final video is often muffled, and the lip-sync is poor.
    • For the best results, you MUST use the direct audio_upload widget in the node itself. This results in crystal-clear audio and far more accurate lip-sync. This single trick is the difference between a failed render and a great one.
  • Poor Performance on Long Clips: MultiTalk is a sprinter, not a marathon runner. The model has a hard cap at 15 sec of video. It’s simply not built for long-form narration.
  • The “Digital Puppet” Effect: While it preserves the face well, it can sometimes lack subtle facial emotion. The mouth moves perfectly, but the surrounding muscles (cheeks, eyes) can remain a bit static, occasionally giving the character a slightly stiff, “ventriloquist dummy” look.

Best Use Case: The Verdict

MultiTalk is the undisputed champion of short-form, rapid-turnaround content.

It is the perfect tool for creating social media clips, memes, reaction videos, or for quickly testing a line of dialogue before committing to a slower, more resource-intensive render with another model. If you’re working locally on a mid-range GPU, this should be your go-to engine for clips under 15 seconds.

To see what MultiTalk is capable of at its best, here is a short clip I generated. The lip-sync is fantastic, and the character’s identity is perfectly preserved.

However, due to a critical mistake I made in the workflow—uploading a 2 minute audio clip and cropped it to 6 seconds—this 6-second clip took over an hour to render.

However, this second clip which has a slightly higher overall resolution (409 000 pixels compared to 399 360 pixels) only took 13 minutes to generate. This shows the importance of being as accurate as possible with the length of the audio as well as the dimensions of the image we’re using.

The Contenders: 2. InfiniteTalk (MeiGen)

If MultiTalk is the short-form sprinter, InfiniteTalk is designed to be the distance runner. It belongs to the same family, but its architecture is built for long-form narrative and continuous, multi-segment speech, aiming to eliminate the need for manual clipping and stitching.

Strengths (The Magic)

  • The Marathon Runner: This is its killer model. InfiniteTalk is built to create longer videos, without the hard 15-second limit of its sibling.
  • Rock-Solid Identity Cohesion: The likeness to the source character is absolutely flawless. It’s even more stable than MultiTalk. The model doesn’t seem to improvise at all, resulting in a crystal-clear, high-fidelity output that never deviates from the source face.
  • High-Fidelity Lip-Sync: The actual mouth movement and synchronization are excellent. The phoneme shapes are accurate and the timing is spot-on, which is critical for long-form content where errors would be more noticeable and jarring.

Weaknesses (The “Gotchas”)

  • The source: For the multitalk model you upload an image and a separate audio clip, and they will together make up the final video with the audio. This model however, requires you to upload a full video with integrated audio together with your image, and it will result in a video with your image and the audio. It’s much easier to get an audioclip to use, than finding a proper video with audio.

Best Use Case: The Verdict

InfiniteTalk is the quintessential tool for longer videos, such as this one.

Independent research like this is self-funded. If this guide saved you hours of troubleshooting, consider fueling the lab.

Support the Project

This 27 sec video clip took ~20 minutes to render. And while a ~20-minute render time for a 27-second clip is a serious commitment, it also needs to be put in perspective. It’s less than 1 minute of rendering time per sec of output video, while the 6 sec multitalk clip takes just over 2 minutes of rendering time per sec video.

The Contenders: 3. WAN-S2V (Alibaba)

While I have seen people claiming to create beautiful videos using this model, I have not managed that at all. And I can not with good conscience recommend this model in its current state. It’s hard to say if the low quality depends on the model itself, the port to ComfyUI or the specific nodes and workflow.

However, I believe that the default settings given in a workflow should create at least a decent result, and I have tested two different workflows that both are taken from official ComfyUI port on Huggingface. Neither of them have produced a reasonable good output from their default settings, and even after trying various different settings the quality just are not good enough.

Whatever the reason for the issues are, I hope it gets fixed so it’s possible to do some real tests with the model.

The results.

Second result.

Also consider to sign up for my newsletter, where you will recieve the latest AI news directly in your inbox.

Published inAIAI VideoComfyUIEnglishTech