Quantization And Quality Degradation

After my last post on how to install CUDA 13.1, optimize for performance and the use of NVFP4, several people has asked about quality degradation when using NVFP4. This seems to come from the idea that every step of quantization results in a noticable degration of quality. This has for the most part been true, but the NVFP4 quantization works different than previous quantizations.

Note: If you are specifically looking for benchmarks comparing BF16 and MXFP8 on Blackwell hardware, you can find that deep-dive here: https://zanno.se/blackwell-mxfp8-nvfp4/

Some people called my last post dishonest for not providing visual comparisons between FP16/8 and NVFP4. There are a few reasons for why I didn’t do that, and the main one being that the post was not about NVFP4, it was about CUDA 13.1 and the possible optimizations – of which NVFP4 is one of several. Another reason is that I simply can’t fit the full FP16 models in my PC when it comes to newer models such as Flux 2 or even Qwen Image.

In this post I will do my best to show, and explain, the differences.

Please note that all images and videos in this post are created with default settings, using the templates available in ComfyUI. This means that I have not spent time tweaking settings, using LoRas or in any other way actively tried to get the best results. Instead the focus is on the differences between various quantization methods. To achieve this I have used the same settings, prompt and seed for every image and video.

Special note regarding SDXL.
Because SDXL is using a different type of CLIP than the rest of the models, the prompt has to be crafted in a different way, hence the result looks a lot different than for the rest of the models. However, the same prompt, settings and seed are used to compare SDXL FP32 and SDXL FP16.

Measuring Quality Degradation

To get past subjective ‘opinion’ and look at the raw technical reality, I’ve used a consistent methodology throughout this post.

For every comparison, I run a pixel-level ‘Difference Map.’

The Baseline: You’ll see a black-box image. If it’s pure black, it means the images are mathematically identical to the naked eye.
The Microscope: To show you what’s happening at the bit-level, I’ve included a second image with a 500% gamma boost. This makes the microscopic ’rounding noise’ visible.

If the gamma-boosted image shows only low-level ‘grain,’ it means the quantization is high-fidelity. If it shows structural outlines or major shifts in color, it means the model is struggling to interpret the data. This way, you don’t have to take my word for it—you can see exactly how much ‘math’ is being lost in translation.

This is how it’s made.

Throughout this post I will only show the baseline image compared to the gamma image.

If you want to hear the full 45-minute discussion on the theory behind these benchmarks, you can listen to the deep-dive episode on the Creepybits Podcast here: Quantization vs Quality Degradation

SDXL FP32 vs FP16

Before we dive into the experimental formats like NVFP4, we need to establish a baseline. You’ll often see models distributed in both FP32 (Full Precision) and FP16 (Half Precision).

Mathematically, FP32 is the gold standard used for training models. It is the ‘source of truth’—the format where every nuance, curve, and weight is calculated with the highest possible level of numerical fidelity.

However, for inference (the actual generation of images), our experiments reveal a different reality. When comparing images generated at FP32 against those generated at FP16, the differences are, for all practical purposes, non-existent.

My pixel-perfect comparison tests show that the visual ‘error’ between these two is well within the realm of standard mathematical rounding. In short: FP32 is essential for the creation of a model, but FP16 is more than enough for using one. Anything above FP16 is simply burning VRAM for no visible gain.

Flux.1 Dev FP16 vs FP8 vs GGUF vs NVFP4

In this post, we’re going beyond the marketing hype to measure the ‘Quality vs. Performance’ trade-off in the Blackwell era. We aren’t just looking at how fast a model runs; we’re using pixel-level difference maps to see what actually happens to the image quality when we squeeze complex models like FLUX.1 Dev into different containers.

We will compare our ‘Gold Standard’ (FP16) against three popular quantization methods: the standard FP8, the widely used GGUF Q5_K_M, and the new, Blackwell-native NVFP4. The reference inference times are all based on 20 steps with the default settings of the Flux.1 dev template in ComfyUI.

Flux.1 dev FP16 (22GB): 48 seconds
Flux.1 dev FP8 (11GB): 35 seconds
Flux.1 dev GGUF Q5 K_M (7.6GB): 53 seconds
Flux.1 dev NVFP4 (8.5GB): 16 seconds

Flux 1 baseline — Flux.1 dev FP16 vs FP8 baseline

The ‘Performance vs. Fidelity’ Trade-off: My experiments show that no quantization method—including Blackwell-native NVFP4—is perfectly lossless compared to FP16. While NVFP4 offers an unparalleled speed advantage, it still introduces structural shifts in high-contrast areas (like neon lighting). The choice for the user is no longer ‘which one is perfect,’ but ‘which trade-off fits my workflow?’

Beyond just math and speed, I noticed a qualitative difference as well. FLUX.1 models are known for a specific, often heavy-handed jawline rendering—the so-called ‘Flux chin.’ Interestingly, in my testing, the NVFP4 format renders a softer, more natural facial structure, effectively bypassing the ‘Flux chin’ artifact that persists in the FP16 and FP8/GGUF versions. It suggests that NVFP4 isn’t just a faster way to run the model; it is, in some ways, a more aesthetically pleasing one.

WAN 2.2 GGUF Q4 K_M vs WAN 2.2 NVFP4

Since I’m unable to blend the videos for comparing the differences, I will just put the results side-by-side. My personal opinion on this is that there’s no visible quality differences between the videos, and the only difference I can spot is a whisp of smoke that is present in the GGUF video. Whether that is good or bad is up to each user to decide for themselves.

One other nice thing if you have a Blackwell GPU is that you will also have access to NVIDIA RTX Video Upscaler. This let’s you upscale a 6 sec 0.4 megapixel video to 1080p in 6.4 sec, and the result is quite satisfying.

The side-by-side video says more than any benchmark log ever could. We aren’t just looking at faster render times; we are seeing a model that is natively interacting with the hardware to produce a higher-fidelity image, without the ‘translation tax’ of traditional quantization. NVFP4 isn’t just a new format—it’s a new standard for Blackwell performance.

Want a 5-minute technical briefing of these findings? Watch the video summary on YouTube: Bypassing the ‘Translation Tax’ for Local AI

If you haven’t already signed up for my newsletter, you can do that by filling in the form below. That way you will never miss these types of posts or other useful guides.

Quantization And Quality Degradation

Measuring Quality Degradation

SDXL FP32 vs FP16

Flux.1 Dev FP16 vs FP8 vs GGUF vs NVFP4

WAN 2.2 GGUF Q4 K_M vs WAN 2.2 NVFP4

Creepybits Newsleter

Thank you!