MXFP8 vs NVFP4: Finding the Blackwell Sweet Spot

A couple of weeks ago I wrote a detailed guide on how to update your ComfyUI system to CUDA 13.1 and utilize the full force on NVIDIA Blackwell GPU. In that post I made the statement that the quality of an image created using NVFP4 is equal to an image created using BF16.

This was a theoritical statement, based on the math NVIDIA released about a month ago.

Because of this, some people called the post dishonest. While I can understand that feeling, based on the wording in the post, dishonest was never my intention. And because of that, I wrote a second post where I went through the differences in quality and speed between various quantization methods. In my previous post I go through the steps from full precision (FP32), to half precision (FP16), to quarter precision (FP8), to GGUF and finally NVFP4.

It later hit me that while NVIDIA writes about Blackwell GPUs being effective for both FP8 and NVFP4, ‘regular’ FP8 doesn’t support Micro-Scaling (MX). To see the actual Blackwell benefit, you need an MXFP8 model. The problem? Almost no one is using that terminology yet, and most ‘FP8’ models you find are the legacy versions.

I even attempted to bake my own MXFP8 models using the official NVIDIA Model Optimizer (TRT-MO). However, the process is incredibly resource-intensive, and I quickly found the physical limits of my current hardware. For those with high-end workstations who want to try the conversion process themselves, it’s a powerful toolkit—but be prepared for a serious workout for your GPU and System RAM.

I have run some tests using the only MXFP8 model I found, which is Z-Image Turbo, and in this post I will compare it to the BF16 and NVFP4 models.

Measuring Quality Degradation

To get past subjective ‘opinion’ and look at the raw technical reality, I’ve used a consistent methodology throughout this post.

For every comparison, I run a pixel-level ‘Difference Map.’

The Baseline: You’ll see a black-box image. If it’s pure black, it means the images are mathematically identical to the naked eye.
The Microscope: To show you what’s happening at the bit-level, I’ve included a second image with a 500% gamma boost. This makes the microscopic ’rounding noise’ visible.

If the gamma-boosted image shows only low-level ‘grain,’ it means the quantization is high-fidelity. If it shows structural outlines or major shifts in color, it means the model is struggling to interpret the data. This way, you don’t have to take my word for it—you can see exactly how much ‘math’ is being lost in translation.

Independent research like this is self-funded. If this guide saved you hours of troubleshooting, consider fueling the lab.

Support the Project

Comparing BF16 with MXFP8 and NVFP4

System:

NVIDIA RTX 5060Ti 16GB VRAM
64GB DDR4 System RAM
1TB Kingston M.2 PCIe 4.0
Python 3.13.11
CUDA 13.1
Pytorch 2.12.0.dev20260307+cu130

Models used:

Z-Image Turbo BF16 – inference time 16.93 sec at 12 steps
Z-Image Turbo MXFP8 – inference time 8.96 sec at 12 steps
Z-Image Turbo NVFP4 – inference time 7.62 sec at 12 steps

Below is the full matrix of the various tests done.

The Verdict: Choosing Your Priority

Looking at the data above, the “degradation” myth is officially busted, but the choice between formats isn’t just about speed.

MXFP8 (The Fidelity Standard): For a mere 1.3-second “tax” over NVFP4, MXFP8 stays remarkably closer to the 16-bit BF16 original. As seen in the bottom-left difference map, it has the lowest “error budget,” making it the ideal choice for professional-grade work where you want the 16-bit look at 8-bit speeds.
NVFP4 (The Performance King): At 7.62 seconds, while the forensic map shows it drifts the furthest from the original math, it reveals a fascinating Blackwell Paradox: in areas of high-frequency detail like the “Cabana” neon sign, the 4-bit version actually produces sharper, more legible text than the 16-bit original.

Final Thoughts

If you have a Blackwell GPU and you are still running models in BF16, you are leaving over 50% of your performance on the table for visual differences that are invisible to the naked eye.

For the ultimate balance, MXFP8 is the winner. It provides near-perfect fidelity and a ~1.9x speedup. But if you’re doing high-volume generation or video work where every second counts, NVFP4 is a miraculous piece of hardware-native engineering that might even clean up your text in the process.

Much is still left for personal preferences and situational use, but as the models keep getting larger we will eventually move more towards various quantization methods – if we want to continue running them locally.

If you haven’t already signed up for my newsletter, you can do that by filling in the form below. That way you will never miss these types of posts or other useful guides.

MXFP8 vs NVFP4: Finding the Blackwell Sweet Spot

Measuring Quality Degradation

Comparing BF16 with MXFP8 and NVFP4

The Verdict: Choosing Your Priority

Final Thoughts

Creepybits Newsleter

Thank you!

Fueling Independent Research