The Case for GPU-Accelerated Data Analytics

For analytics workloads that fit in fast memory, the hardware case for GPU is strengthening — but the story is more nuanced than raw compute numbers suggest.

TL;DR

AI agents are changing the analytics workload — agentic speculation is exploding demand for structured analytic compute.
CPU analytics has had a great run, but for in-memory workloads the gap is shifting to GPU — CPU-centered databases powered enterprise analytics for decades, but for workloads that fit in fast memory, GPU bandwidth has crossed into a clear and durable lead. Compute and cost advantages, once decisive, are now compressing as GPU prices rise faster than per-chip gains.
GPU-accelerated databases are rising in research and industry — Conferences have seen a wave of GPU database papers since 2020, and GPU acceleration is reaching production tools. Yet building correct, full-featured GPU query engines remains a formidable engineering challenge.
NVIDIA has built a moat with RAPIDS AI and libcudf — virtually every GPU-accelerated analytic system today is built on libcudf, making it the critical layer to understand in this space.

The Analytics Workload Is Changing Fast — Enter AI Agents

In 2025, LLM-powered AI agents started proving their value, and their adoption has been rapidly spreading across enterprises, particularly for data analytics and insights extraction. The 2025 State of AI in Enterprise report shows that companies are now moving from piloting the technology to actually deploying it in production, noting that “many companies focused on experimenting last year [2025] have crossed the threshold into operational AI systems.”

Databricks is at the forefront of adopting LLMs and agent technology, and it is worthwhile to follow how they have been preparing for the coming explosion in their adoption. Through their latest Lakebase architecture, Databricks shows they are positioning for both OLTP and OLAP workloads required by agents for the full automation of the data exploration and productionization pipelines.

This architecture eliminates much of the cost, complexity, and lock-in that have defined databases for decades, and it is especially powerful for modern AI and agent-driven workloads, where developers want to launch many instances, experiment freely, and pay only for what they use.

Their latest Genie product is their version of the AI agents that will carry out this work, driven solely by high-level natural language commands tied to business needs.

Genie Code can autonomously carry out complex tasks such as building pipelines, debugging failures, shipping dashboards, and maintaining production systems.

Together, these advances will help bring AI agents to the market, simplifying much of the data science workflow. But Databricks believes a much bigger wave is ahead, one where agents are unleashed to search for insights by trying many different paths. They call this Agentic speculation, “a high-throughput process of exploration and solution formulation for the given task,” which Databricks engineers envision will require redesigning data systems to be agent-first¹.

Overall, as agentic workloads become more and more prevalent, the sheer scale and inefficiencies of agentic speculation will become the bottleneck, and our data systems will need to evolve in response

The impact on the analytics workload will be profound. Future systems will be designed almost exclusively with AI agents as first-class users, performing exploration, identifying insights, and productionizing their solutions. All of this will be done from raw structured and unstructured data.

Where will the engineering bottlenecks be? Agentic speculation will dramatically increase the velocity of both code generation and analytical queries, vastly increasing the effective memory bandwidth and working set memory requirements of the underlying systems. AI agents will also become more deeply integrated into the data infrastructure to provide intelligent exploration. Do we have the right software and hardware to support this movement? Today, we see massive investment in serving inference from GPUs, but not enough analytics workloads have been accelerated, and this is likely to become a major bottleneck.

The question I am posing is whether current CPU-centered data processing systems will be capable of handling the scale needed to support these new agentic workloads.

CPU-Centered Analytics: Decades of Dominance

In the past few decades, analytic query engines have been very successfully built around the CPU architecture, featuring a growing number of high-performance server cores (in the hundreds), deep cache hierarchies (in tens of MBs), and vectorized operations taking advantage of wider SIMD instruction sets (up to 512 bits wide).

To meet ever-larger volumes of data stored in object stores, these engines moved towards disaggregated architectures that enable elastic scaling of compute and storage. Coupled with open columnar data formats like Parquet and Arrow, this shift has fostered a wide ecosystem of query engines built on a composable data philosophy. It has been a remarkable run.

The milestones speak for themselves. Snowflake’s 2016 architecture pioneered separating compute from storage entirely, proving that cloud-native disaggregation could deliver elastic, multi-tenant analytics at scale. On the single-node analytical engine front, DuckDB brought embeddable, vectorized OLAP to the edge; ClickHouse pushed columnar execution to extreme throughput on commodity hardware; and Umbra/CedarDB pushed the boundary on single-node performance with JIT query compilation via LLVM and a hybrid row/columnar storage engine capable of handling both transactional and analytical workloads on a single system.

The composable data systems movement² has further decoupled execution from storage, built on two key standards: Apache Arrow as the universal in-memory columnar format enabling zero-copy data exchange between engines, and Substrait as a portable, cross-language query plan representation that lets a plan produced by one system be executed by another. On top of these, Velox (Meta) and Apache DataFusion provide reusable, modular physical execution engines that plug into larger systems rather than reinventing the wheel. This composability is now flowing upstream into the dominant distributed compute platforms — Gluten brings Velox-backed native execution into Apache Spark, Apache DataFusion Comet does the same using DataFusion as the native Rust backend, and Presto has adopted Velox as its native C++ evaluation engine — extending the CPU performance frontier by replacing JVM-based execution with optimized native kernels.

CPU vs GPU Hardware Trajectories: The In-Memory Gap Is Shifting in GPU’s Favor

Yet even as software pushes the CPU performance frontier further, the underlying hardware is hitting diminishing returns. AMD’s EPYC Turin, today’s server CPU bandwidth leader, peaks at ~576 GB/s per socket (+25% vs Genoa’s ~461 GB/s) and ~15 TFLOPS FP32 (+~40% vs Genoa’s ~11 TFLOPS), with max DRAM capacity flat at 6 TB across both generations. Intel’s Xeon 6 (Granite Rapids) reaches ~409 GB/s (+33% vs Sapphire Rapids’ ~307 GB/s) and ~10 TFLOPS FP32 (~2× vs Sapphire Rapids’ ~4.8 TFLOPS), with capacity likewise flat at 4 TB. Meaningful gains but incremental, and capacity has effectively plateaued.

GPUs tell a different story. Driven by the insatiable demand for AI training and inference, NVIDIA’s flagship data-center superchips have advanced at a fundamentally different pace across just three generationsm the GH200 (Grace Hopper, 2023), GB200 (Grace Blackwell, 2024), and VR200 (Vera Rubin, 2025): memory bandwidth grew 9x from 4.9 TB/s to 44 TB/s, FP32 compute grew from 67 to 260 TFLOPS, and total unified memory capacity grew 3.4x generation-over-generation: 624 GB -> 864 GB (+39%) -> 2.1 TB (+143%). That puts VR200 at about 76x the bandwidth of a single EPYC Turin socket.

But for in-memory analytics, workloads whose active dataset fits within the fast memory tier (HBM for GPUs, DRAM for CPUs), raw hardware scaling alone does not determine the winner. A three-generation study across three lenses (compute parity, $1M bare-metal budget, equal AWS hourly spend) reveals two distinct trends pulling in opposite directions:

GPU memory bandwidth is the most durable advantage and is not eroding. It crossed above parity against a compute-equivalent CPU cluster between the GH200 and GB200 generations, and holds steady at 3.5–5.4× at equal spend across all three generations. The inflection is generational and structural.
GPU compute, cost, and Perf/W advantages are real but compressing. At equal spend, the FP32 advantage peaked at the H100 generation (~9.5× over Genoa on AWS) and has since fallen to ~5.2× for B200 — not because GPU compute plateaued, but because GPU prices are rising faster than per-chip compute gains. The Perf/W lead is narrowing for the same reason.
The capacity gap is large but driven by price, not physics. CPU DRAM holds 51–91× more memory at compute parity, but that collapses to 8–11× at equal spend. The difference is that DDR costs a fraction of HBM per gigabyte — once you normalize by budget, you are buying far more DDR capacity than the raw chip-count comparison suggests. If HBM prices fall relative to DDR over time, a trend already underway, this ratio will compress further in the GPU’s favor.

Assumptions: The directional conclusions above rest on specific cost and pricing inputs — rack-normalized GPU superchip prices ($39k–$188k per chip), AMD EPYC socket prices (~$8k–$14k), and AWS on-demand rates from April 2026. Cost figures are the most assumption-sensitive part of the analysis: GPU list prices vary by channel and contract, and cloud rates change frequently. The bandwidth and compute trends are hardware-spec-driven and more stable; the capacity and cost conclusions are pricing-driven and should be read as directional, not precise.

For a more detailed generation-by-generation comparison, see: GPU vs CPU for In-Memory Analytics: Bandwidth Holds as Compute and Cost Advantages Narrow Across Three Generations.

Coming next: The analysis above is scoped strictly to in-memory workloads. A follow-up post will delve into the big data case where datasets exceed GPU HBM capacity, and where the HBM bandwidth advantage disappears at the PCIe or NVMe bottleneck, and CPU DRAM’s structural capacity advantage becomes decisive for analytics at scale.

GPU-Accelerated Databases Are Rising in Research and Industry

Unsurprisingly, the database research community has been paying close attention since 2020, with top conferences like SIGMOD and VLDB regularly accepting papers evaluating and building GPU-accelerated databases — both hybrid CPU-GPU and fully GPU-native. Recent highlights include:

Rethinking Analytical Processing in the GPU Era³ (CIDR 2026) — Sirius, a GPU plugin for DuckDB that rethinks analytical processing natively on the GPU.
Scaling GPU-Accelerated Databases beyond GPU Memory Size⁴ (VLDB 2025) — tackles the fundamental GPU memory capacity bottleneck with a hybrid CPU-GPU filtering strategy, achieving a 3.5× speedup over SQL Server at 1 TB scale on a single A100.
GPU Database Systems Characterization and Optimization⁵ (VLDB 2024) — systematically characterizes GPU database performance bottlenecks and proposes optimizations for modern workloads.
A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics⁶ (SIGMOD 2020) — proposes Crystal, a GPU query library, and shows that full query GPU speedup can exceed the memory bandwidth ratio (up to 25×) due to CPU vectorization limitations.

On the industry side, 2025 saw GPU acceleration reach mainstream data tools:

GPU execution landed in CPU dataframe engines like Velox⁷ and Polars⁸.
The RAPIDS Accelerator for Apache Spark⁹ enabled faster migration to GPU-accelerated distributed data engineering and analytics workloads.
Voltron published the design paper for Theseus¹⁰, their petabyte-scale GPU accelerated query engine.

Despite genuine progress, building correct and performant GPU implementations of the full relational algebra remains enormously difficult. Managing GPU memory limits, PCIe transfer bottlenecks, operator fusion, and full SQL coverage is a hard engineering problem with no easy shortcut.

NVIDIA’s Moat: RAPIDS and libcudf

NVIDIA has seen this challenge coming for a while and has been systematically building a solution through its RAPIDS AI¹¹ ecosystem, first launched in 2018¹², well before the generative AI and LLM revolution had taken hold. At its core is a little-known C++ library, libcudf (and its sister libraries), a highly optimized, native GPU foundation that underpins virtually all GPU-accelerated analytic systems being built today.

It is the de facto single-node physical operator infrastructure in this space, and understanding it is the key to understanding how GPU databases actually work. And yet, despite its central role, in-depth technical coverage of libcudf’s internals is surprisingly scarce. Most available material stays at the user-facing API level, leaving critical questions about kernel design, memory management, and performance characteristics largely undocumented outside of the source code itself.

In future posts, I’ll thus be diving deeper into the technical internals of libcudf and answering questions such as:

❓ How does libcudf translate relational operators into parallel GPU kernels?
❓ What is the tooling like to evaluate the library’s performance?
❓ How is the libcudf used as a building block for larger distributed systems?

We are at an inflection point. The hardware gap between CPUs and GPUs is no longer a niche concern for ML engineers — it is becoming structurally relevant for anyone building or operating data systems at scale. For in-memory analytics, the shift is already underway: GPU bandwidth has crossed into a durable lead and the remaining gaps in cost and capacity are narrowing, not widening. The harder question — how this plays out when datasets exceed HBM capacity and the bottleneck shifts to PCIe or storage — is the subject of a future post. The research momentum, the industry adoption, and NVIDIA’s deliberate infrastructure investment all point in the same direction: GPU-accelerated analytics is moving from experimental to essential. The open question is not whether it will happen, but how fast the ecosystem matures and how much of the existing CPU-centric stack it displaces versus complements.

Excited about the momentum of GPU-accelerated analytics? Have questions about the software or hardware stack? Let me know below! 👇

References

Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First — https://arxiv.org/pdf/2509.00997 ↩
The Composable Data Management System Manifesto — VLDB 2023 — https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf ↩
Rethinking Analytical Processing in the GPU Era — https://arxiv.org/pdf/2508.04701 ↩
Scaling GPU-Accelerated Databases beyond GPU Memory Size — VLDB 2025 — https://vldb.org/pvldb/vol18/p4518-li.pdf ↩
GPU Database Systems Characterization and Optimization — VLDB 2024 — https://vldb.org/pvldb/vol17/p441-cao.pdf ↩
A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics — SIGMOD 2020 — https://arxiv.org/pdf/2003.01178 ↩
Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF — https://developer.nvidia.com/blog/accelerating-large-scale-data-analytics-with-gpu-native-velox-and-nvidia-cudf/ ↩
RAPIDS Adds GPU Polars Streaming, a Unified GNN API, and Zero-Code ML Speedups — https://developer.nvidia.com/blog/rapids-adds-gpu-polars-streaming-a-unified-gnn-api-and-zero-code-ml-speedups/ ↩
RAPIDS Accelerator for Apache Spark — https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/apache-spark-3/ ↩
Theseus: A Distributed and Scalable GPU-Accelerated Query Processing Platform Optimized for Efficient Data Movement — https://arxiv.org/pdf/2508.05029 ↩
RAPIDS AI — https://rapids.ai/learn-more/ ↩
GPU-Accelerated Data Analytics & Machine Learning (RAPIDS AI Launch, 2018) — https://developer.nvidia.com/blog/gpu-accelerated-analytics-rapids/ ↩