If you spend any amount of time running local AI models, you’ve probably seen the acronym “CUDA” thrown around in error logs, forum posts, and installation guides. We all know we need it, but what actually is it?
In the simplest terms, CUDA (Compute Unified Device Architecture) is the ultimate translator between your software and your hardware.
Out of the box, your CPU is a generalist—it’s great at doing a few complex tasks very quickly, one at a time. Your NVIDIA GPU, on the other hand, is loaded with thousands of tiny cores designed to do simple math simultaneously. However, standard software (like Python) doesn’t naturally know how to speak to those thousands of GPU cores.
Without CUDA, your insanely expensive graphics card is basically just a really powerful space heater that happens to be good at rendering video games.
CUDA is NVIDIA’s proprietary programming interface that bridges this gap. It allows frameworks like PyTorch (and by extension, ComfyUI) to bypass the CPU and send heavy, parallel workloads—like the billions of tensor calculations required to generate an image or run a massive local LLM—directly to the GPU.
But here is the catch: as NVIDIA releases new GPU architectures (like the RTX 5xxx Blackwell cards), they also update CUDA to include brand new, highly optimized instruction sets. If you are running an older version of CUDA on a brand new card, you are essentially driving a sports car in second gear.
Today, we are going to fix that. NVIDIA just dropped the CUDA 13.1 toolkit, bringing massive optimizations for modern AI inference. In this guide, I’ll walk you through exactly how to upgrade your system to natively support it, banish those “version mismatch” errors, and squeeze every last drop of performance out of your ComfyUI setup.
Why upgrading to 13.x CUDA?
Since you found my website you probably use AI either for work or for fun, and you have most likely heard of various quantization formats such as fp8 and int4. Using CUDA 13.x and Nvidia Blackwell GPU gives you access to NVIDIA’s own quantization, NVFP4, which makes the models approximately 3 times faster and 1.8 times smaller than fp8 quantization.
Another benefit is the ability to use Flash Attention 4, written in CuTeDSL, and optimized for the Blackwell and Hopper architecture. Flash Attention 4 is roughly 2 times faster than Flash Attention 3, and 2.7 times faster than Triton.
Finally, your GPU can handle tile-based GPU programming through Julia, making it much easier to develop high-performance CUDA kernels, without the need to use cuTile Python as a bridge.
Is the upgrade worth it?
Before diving into the tutorial, I ran benchmark tests on my RTX 50-series Blackwell GPU. By simply matching the CUDA 13.1 Toolkit and PyTorch 2.10 to the hardware, the inference speed increases were staggering—up to a 70% reduction in generation time.
I generated the exact same images, using the exact same settings and models, both before and after updating my system using the steps in this guide.
I also tested Z-image Turbo and Qwen Image with NVIDIA’s native 4-bit quantization (NVFP4). By unlocking this on my Blackwell GPU, the speed gains became truly transformative.
- Z-image Turbo bf16 (12 steps):
- Before: 43 seconds ➔ After: 26 seconds (~40% faster) ➔ NVFP4: 13 seconds (70% faster!)
- Qwen Image fp8 (20 steps):
- Before: 198 seconds ➔ After: 124 seconds (37% faster!) ➔ NVFP4: 77 seconds (61% faster!)
- Flux 1 Dev bf16 (20 steps):
- Before: 56 seconds ➔ After: 46 seconds (~18% faster)
The most impressive part? Despite the extreme compression, the image quality remains indistinguishable from the BF16 originals. You are getting 16-bit quality at 4-bit speeds.
Note: The above statement is theoretical, based of the math released by NVIDIA. For visual comparisons I have written about full, half and quarter precision as well as GGUF and NVFP4 here: https://zanno.se/quantization-and-quality-degradation/
For a comparison in quality between BF16 and MXFP8 please use this link instead: https://zanno.se/blackwell-mxfp8-nvfp4/
Here’s what you need to follow this guide
- The latest NVIDIA drivers for your GPU (Important!)
- NVIDIA RTX 5xxx GPU
- Python 3.13 (ComfyUI or other installation)
- Minimum of 10 GB free hard drive space
Installing and upgrading
Drivers
The very first thing you have to do is download and install the latest drivers (at a very minimum you need driver version of >=580) for your Nvidia card. You can download the driver from NVIDIA’s desktop app if you are using it, or you can find it at NVIDIA drivers. I personally prefer the studio drivers but if you need the game ready drivers for some reason, that will work as well.
Check Python version
If your Python version is lower than 3.13 you will have to update. If you are unsure of your Python version, open powershell inside your python_embeded folder and type ./python –version.

If you, like me, have Python 3.12.x then this is what you need to do to update.
To update your Python, you will have to completely re-install ComfyUI which have Python 3.13 embeded. You can just download ComfyUI portable from github and install it by following their instructions.
Before you do a complete new installation I highly recommend that you move your whole model folder, located in your ComfyUI directory. Move the model folder somewhere outside of the current ComfyUI installation. This way you will save all the models and LoRas, and can use them in your new installation. This will save you a lot of time and headache. You might also want to note which custom nodes you have installed, so you can re-install them again later. If you don’t, you just have to install the nodes you need, when you need them.
Do not copy/move your custom nodes folder with the purpose of re-using it in your new ComfyUI installation. While it might sound like a smart idea that will save you time, you will end up with a lot of dependency issues when ComfyUI tries to install all the custom nodes at the same time.
When you have made sure you have moved or copied your model folder to a safe location, delete the whole ComfyUI installation folder. Unpack the new ComfyUI you just downloaded to where you want the new installation and run the appropriate start script (i.e. run_nvidia_gpu.bat). ComfyUI will now install and update your dependencies.
The good news if you had to re-install ComfyUI is that you will now have the correct CUDA and PyTorch installed, as it comes bundled with the latest portable ComfyUI. You can check and confirm your current version using these commands.
python_embeded/python.exe -m pip show torch
python_embeded/python --version
nvcc --version

If you just re-installed ComfyUI and you can confirm your PyTorch, CUDA and Python versions with the commands above, you are done. Before starting up your new ComfyUI, move the model folder you previously saved, to your new ComfyUI installation folder.
CUDA Toolkit
Next you will need to get the Nvidia Toolkit. You can either download the installation file (approximately 2.5 GB), or you can install it using the network installer. You can pick whichever you like. Download the toolkit from here: CUDA Toolkit 13.1
Once you have downloaded the installation file, simply install it like any other program. You can use the quick installation option. When the toolkit is installed, restart your computer.
After you restarted your computer, check your CUDA version by opening your PowerShell and typing:
nvcc --version
Install PyTorch
Before you install your new PyTorch, I recommend that you uninstall the old ones. To uninstall the old PyTorch opening your PowerShell in your ComfyUI installation folder and run the command:
python_embeded/python.exe -m pip uninstall torch torchvision torchaudio
To install PyTorch 2.10 for CUDA 13.0, run the following command in the same powershell window:
python_embeded/python.exe -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
NOTE! If you want to be able to use Flash Attention 4, you have to install the nightly version of PyTorch. Flex Attention has been part of the PyTorch nightly build since PyTorch 2.5, and now also has a FlashAttention-4 backend. PyTorch automatically generates CuTeDSL score/mask modification functions, and to JIT-instantiate FlashAttention-4 for custom attention variants.
To install PyTorch 2.10 nightly build for CUDA 13, run the following command in powershell:
python_embeded/python.exe -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu130
Confirm you have the correct installation by running the command:
python_embeded/python.exe -m pip show torch
Independent research like this is self-funded. If this guide saved you hours of troubleshooting, consider fueling the lab.
Support the ProjectComfy Kitchen
By default Comfy Kitchen is installed to work on all systems that currently work with ComfyUI. Since you have a NVIDIA Blackwell card, you would want to uninstall the default version and install the version which uses cuBLAS.
Uninstall the default Comfy Kitchen by running this command:
python_embeded/python.exe -m pip uninstall comfy-kitchen
Install the version of Comfy Kitchen that is specifically optimized for your Blackwell GPU by running this command:
python_embeded/python.exe -m pip install comfy-kitchen[cublas]
Sage Attention 2++ and 3
In some situation Sage Attention might still be useful. However, the official github repo has removed Sage Attention 2++ and 3 from PyPi due to a bug. The error has since been fixed, but it’s still not restored on PyPi. This means that you either have to compile the code yourself, or see if you can find a .whl that works for your system.
Prepare your GPU for tile-based computation
While tilebased programming isn’t standard in ComfyUI at the time of writing this, I am confident that we will see lot of that in the future. To prepare your system for this, you will need to install a few packages.
Install cuTile
python_embeded/python.exe -m pip install cuda-tile
Install CUDA 13.x ready Cupy
python_embeded/python.exe -m pip install cupy-cuda13x
Make sure that Pytest and Numpy are installed
python_embeded/python.exe -m pip install pytest numpy
import cupy as cp
import numpy as np
import cuda.tile as ct
@ct.kernel
def vector_add(a, b, c, tile_size: ct.Constant[int]):
# Get the 1D pid
pid = ct.bid(0)
# Load input tiles
a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
b_tile = ct.load(b, index=(pid,), shape=(tile_size,))
# Perform elementwise addition
result = a_tile + b_tile
# Store result
ct.store(c, index=(pid, ), tile=result)
def test():
# Create input data
vector_size = 2**12
tile_size = 2**4
grid = (ct.cdiv(vector_size, tile_size), 1, 1)
a = cp.random.uniform(-1, 1, vector_size)
b = cp.random.uniform(-1, 1, vector_size)
c = cp.zeros_like(a)
# Launch kernel
ct.launch(cp.cuda.get_current_stream(),
grid, # 1D grid of processors
vector_add,
(a, b, c, tile_size))
# Copy to host only to compare
a_np = cp.asnumpy(a)
b_np = cp.asnumpy(b)
c_np = cp.asnumpy(c)
# Verify results
expected = a_np + b_np
np.testing.assert_array_almost_equal(c_np, expected)
print("✓ vector_add_example passed!")
if __name__ == "__main__":
test()
You can also download the code here: VectorAdd_quickstart.py
Unzip VectorAdd_quickstart.zip to your ComfyUI installation folder and run the following command:
python_embeded/python.exe VectorAdd_quickstart.py
If everything is installed and working correctly this message will show in your console window:
✓ vector_add_example passed!
A final word
By following this guide you will not only update your system to work with the latest technology and software for performance and efficiency, but is also preparing your system to be ready for what will become standard in a not-too-distant future.
A lot of things that until very recently were annoying and complicated to install, often causing dependency issues or only being available for specific platforms or operating systems, are now integrated for convenience.
For example:
- Triton is now integrated in Comfy Kitchen and doesn’t need to be installed independently
- Flash Attention 4 is integrated in the PyTorch nighly build, and we no longer have to compile our own code to make it work in Windows.
- CUDA 13.x is not only working with cuTile Python, but is also making it available for Julia which potentially will make complex coding in ComfyUI easier and more efficient in the future.
