China has made impressive strides in finding alternatives to NVIDIA’s trimmed-down AI accelerators. DeepSeek has introduced new technology that delivers eight times the TFLOPS with their Hopper H800 AI accelerators.
DeepSeek’s FlashMLA Promises to Maximize China’s AI Performance Using NVIDIA’s Modified Hopper GPUs
Instead of relying solely on external sources to improve their hardware capabilities, Chinese companies, particularly DeepSeek, are turning to clever software solutions to leverage existing equipment. DeepSeek’s latest innovations are making waves, as they claim to have extracted substantial performance from NVIDIA’s scaled-back Hopper H800 GPUs. They’ve achieved this by optimizing memory usage and resource allocation across various inference requests.
To give you a bit of context, DeepSeek is hosting an “OpenSource” week, unveiling tools and technologies for the public to access via GitHub. The event kick-started with the reveal of FlashMLA, a “decoding kernel” tailored for NVIDIA’s Hopper GPUs. Let’s dig into the landmark improvements it’s bringing to the table, as they are indeed groundbreaking.
DeepSeek boasts achieving 580 TFLOPS for BF16 matrix multiplication on the Hopper H800, an impressive figure approximately eight times the usual industry standards. Additionally, with savvy memory management, FlashMLA delivers up to 3000 GB/s of memory bandwidth, almost twice the H800’s theoretical maximum. What’s remarkable here is the fact that all these enhancements are the result of software tweaks rather than any physical hardware upgrades.
The core of FlashMLA’s performance leap lies in its “low-rank key-value compression.” In simpler terms, this technique breaks down large data sets into smaller, more manageable chunks, speeding up processing and cutting memory use by 40% to 60%. Another feature is its block-based paging system, which dynamically adjusts memory allocation according to task demands. This enables models to handle variable-length sequences more efficiently, thus boosting overall performance.
What DeepSeek has achieved with FlashMLA illustrates the expansive nature of AI computing, proving it doesn’t hinge on any single element. At present, the tool appears tailored for Hopper GPUs, and it will be interesting to see the kind of results it might deliver with the H100 through FlashMLA.