Cherif Jazra

On Pope Leo XIV’s Letter on Artificial Intelligence

2026-06-03T07:00:00+00:00

On May 15, 2026, Pope Leo XIV released an encyclical letter expressing his thoughts on Artificial Intelligence for all of us to reflect on. This couldn’t have come at a better time, because today more than ever, the fast development of AI technology has been challenging me to reflect more deeply and with more urgency on what are the most important things to focus my life on, as a parent, a citizen, an engineer. I would thus like to relay his letter and offer some of my thoughts, especially on the first half of it, and in the humble spirit that claims no full knowledge of the truth.

The letter is long but worth a read. In it, Pope Leo XIV challenges us to think about whether the promise of AI technology will be pursued under the delusion of human infallibility and self-sufficiency or in the spirit of an authentic and responsible stewardship that honors and preserves. Rather than humans sacrificing their dignity and being subservient to technology, it is technology that must be directed to serve human dignity. The pope thus spends a good portion of his letter explaining the main principles developed by the Catholic Church to honor and preserve the dignity of every human being, the principle of common good, of universal destination of goods, of subsidiarity, of solidarity, and of social justice. “The fundamental dignity of each person, therefore, is neither acquired nor earned, nor does it need to be justified [..] Every human person possesses an infinite dignity, inalienably grounded in his or her very being, which prevails in and beyond every circumstance, state, or situation the person may ever encounter” [53]. The same spirit pervades the sacred American creed written 250 years ago, that all “men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness”.

Today, the challenges facing humanity are immense because society is dominated by the modern technological way of being. In order to confront this reality, we must be clear-eyed about what technology truly is, not the sum of all created technical things but a reflection of our own modern way of revealing the world, our “tendency to let the logic of efficiency, control, and profit alone shape personal, social and economic decisions”[92](emphasis mine). A new historical reality has set in, one in which the dominant actors driving these technological breakthroughs are private corporations large enough to mediate almost all societal interactions and with enough financial resources to bend the democratic system in their favor. As Pope Leo XIV says, “In many cases within the digital context, control over platforms, infrastructure, data and computing power does not rest with States, but with major economic and technological actors. These entities effectively set the conditions for access, determine the rules of visibility and shape the very possibilities for participation. When such power is concentrated in the hands of a few, it tends to become opaque and evade public oversight, increasing the risk of distorted forms of development that give rise to new dependencies, exclusions, manipulations and inequalities.”[95]

The impending arrival of what is now called Artificial General Intelligence (AGI) technology has been dazzling the world, feeding its craving for unlimited superhuman capabilities on demand, but at the same time auguring an era that deeply questions the role of humans in society. Pope Leo XIV warns us not to confuse artificial intelligence with human intelligence and not to succumb to the hubris of the Tower of Babel. AI systems are very useful but they are not human. They “do not undergo experiences, do not possess a body, do not feel joy or pain, do not mature through relationships and do not know from within what love, work, friendship or responsibility mean. Nor do they have a moral conscience, since they do not judge good and evil, grasp the ultimate meaning of situations, or bear responsibility for consequences”[99]. All that AI is trained to do is to imitate language and simulate empathy and understanding. It feels nothing, it is simply a cold and empty “form of statistical adaptation based on data and feedback, which can be very effective, but does not imply inner growth”. The distinction between human and artificial intelligence is so fundamental that it ought not even be questioned, and yet here we are so lost to technological thinking that this fact is no longer “obvious” and needs to be argued. In the letter, the pope has a message of vigilance for those developing this technology because he understands technology is not a morally neutral instrument. All human creations embody in them the values of their creator. “Every technical tool embodies choices and priorities through what it measures, ignores and optimizes, and how it classifies people and situations.”[104] And for this reason, with AI even more than any other technology, it is important to understand and “examine how that system is designed and what vision of the human person and society is embedded in the data and models that guide it” [104]

Pope Leo XIV uses the provocative concept of Disarming AI to wake us up to the reality of a race to the bottom in an AI-weaponized world with catastrophic consequences. This expression brings our mind back to the time when the atomic bomb was detonated to end WWII and destroyed countless lives in the most horrific way possible. Shortly after humanity realized that it had created a technology that could bring it to its end. The evident potential for mutual destruction and annihilation were so overwhelming that they awakened global consciousness and sustained the movement to disarm nuclear weapons during the Cold War. We have not yet seen today’s AI Hiroshima moment. Still in the future lies the threat of uncontrollable autonomous AI weapons or society-wide AI automation leading to massive unemployment and breakdown of social relations. These dangers are much discussed today, but there is one more that the pope emphasized which I really appreciate, one that is much less talked about and less conceptualized. This is the subtle but slower and more profound way AI will impoverish our human existence as it presents itself as the ultimate possibility of liberation from human suffering. Trans-humanism and post-humanism [116] are today’s ideologies pushing furthest in this direction. In reality if left to seep through society without any sense of the impending danger, AI technology will turn out to be the biggest challenge to the human creative spirit and pursuit of happiness.

All my readings and experience indicate to me that for millennia now, from the early greek philosophical writings, to the first century apostolic gospels and epistles expressing the radical Christian message, to the 20th century modern existential philosophies responding to the calamities of two World Wars, human being at its core has been understood to mean overcoming a fallen, average, and dispersed absorption in worldly matters. But overcoming how? Certainly not by removing the body altogether and replacing it with a machine (or by escaping to Mars), but overcoming by a process of resolutely taking up the responsibility for our ownmost possibilities, freed to face the anxiety of our finitude, embedded in our communities, rooted in a place of belonging, and from the Christian point of view grounded in the unity of being that is God. The most optimistic supporters of AI will have us all believe our salvation lies in a human created technology. Pope Leo XIV, however, understands this not to be true, that there is no final solution to the problem of human existence, that the danger lies in thinking that there is, that instead it falls as a task to every generation to renew human existential possibilities and make them available to the largest number of people: “I am convinced that the concrete way of living out social relationships in the light of the Gospel is not established once and for all, but remains a task entrusted, from generation to generation, to the Christian community.”

There is more in this letter that is worth pondering and I hope to touch on in future posts after more reflection and reading. I would just like to close with the special message that Pope Leo XIV addressed to developers and engineers working on the development of AI technology, and I would say in technology in general and not just as engineers, whether in Silicon Valley or around the world, that is worth asking oneself about (emphasis mine):

“I wish to address a special appeal to those who develop artificial intelligence. In one sense, technological innovation can represent human participation in the divine act of creation. Developers, therefore, bear a particular ethical and spiritual responsibility, for every design choice reflects a vision of humanity. Just as the creator of an artistic or literary work must consider the values it conveys, so developers are called to embed values in their projects with due seriousness: with transparency, responsibility toward affected communities and careful attention to ensuring that what is being cultivated is a genuine good” [111].

Encyclical Letter, Magnifica Humanitas

Short speech by the pope summarizing his thoughts on Disarming AI

Inside RAPIDS libcudf: a deep dive into a simple GroupBy aggregation

2026-05-13T07:00:00+00:00

Traditional OLAP database execution engines were designed for the CPU: 1) optimized for a handful of powerful cores, 2) deep cache hierarchies, and 3) sequential or lightly vectorized processing. In the past decade, however, GPU performance and functionality have greatly advanced, driven largely by the generative AI revolution, to the point of becoming a viable platform for running relational workloads. GPU-accelerated data systems that can run queries orders of magnitude faster than their CPU equivalents will enable the next big revolution in analytics, fuelled by AI agents. Their architecture and programming models are, however, different enough from the CPU, that specialized algorithms must be developed to achieve high performance on analytical workloads. This post aims to illuminate the kind of algorithms NVIDIA’s RAPIDS project has built to close that gap. It is the first in a series exploring libcudf, NVIDIA’s core DataFrame library for single-node GPU data processing.

TL;DR

The post dives into libcudf's hash-aggregate fast path for a GROUP BY … SUM query with low per-block key cardinality: each block must see at most 128 distinct grouping values to stay on the shared-memory path.
For the dataset analysed, libcudf uses a two-level shared-memory strategy to tame atomic contention. Each CUDA block deduplicates its rows in a private on-chip hash set before touching global memory, so the device-wide hash set sees at most one insert/lookup per distinct key per block and the output column receives at most one atomic add per distinct key per block, not one per row.
The algorithm runs in four sequential phases. (1) Initialize the device hash set with a sentinel value; (2) map every row to a block-local rank and elect cross-block key representatives via CAS into the device hash set; (3) retrieve unique keys from the hash set and rewrite index arrays to dense output offsets; (4) accumulate partial sums in a shared-memory accumulator array, then flush one atomic add per group per block to the output column.
For 100M rows, the dominant cost is data structure overhead, not compute. A significant fraction of total kernel time is spent initialising the oversized hash table (allocated at 2× input size) and scanning it in the Interlude, costs that are independent of key cardinality and grow with input size. Future posts will explore performance at higher scale factors.

This report was produced with the help of AI agents.

1. Introduction: Relational Algebra on GPUs

Mapping relational algebra onto GPUs introduces a massive semantic gap compared to CPU. Operators like joins, aggregations, and sorts must be entirely reimagined for the GPU SIMT (Single Instruction, Multiple Thread) architecture. Conventional algorithms natively optimized for CPUs often hit brutal bottlenecks on GPUs due to thread divergence, uncoalesced memory access, and severe penalties for global synchronisation.

To bridge this runtime gap, NVIDIA developed libcudf: a C++ library implementing foundational DataFrame operations and relational primitives natively on the GPU. It has emerged as the de facto execution framework for a massive portion of the accelerated data ecosystem, underpinning projects like Spark RAPIDS, Dask-cuDF, Velox CuDF and numerous independent database research efforts.

The central questions driving this exploration are:

How does libcudf translate fundamental relational operators into massively parallel GPU kernels?
What are its structural strengths, and where does the GPU memory/compute model impose hard limits?
What does the developer tooling look like, and how does one reason about its hardware utilization?

To answer these questions, I begin by identifying at a high level which algorithms underlie key primitives. I take as an example the simple groupby aggregation and break down into components showing how the library implements them. I identify the main data structures and provide illustration that show the movement of data as the algorithm runs. While some runtime info is provided for the 100M dataset in this post, the next one will focus on actual run and reviews learning from the debugging tools.

The Aggregation Problem

GROUP BY is one of the foundational operators in relational database systems. Its job is to partition an input relation into disjoint subsets then reduces each group to a single output row by applying one or more aggregate functions. For example, a simple question a retail merchant might ask is what the breakdown of the total price of orders by order status, identifying the amount of missed dollar opportunity for orders not completed and investigating improvements. We will use this example in the rest of the blog post.

In a query execution engine the GROUP BY physical operator must solve two logical subproblems:

Key partitioning: determine, for every input row, which output group it belongs to. This is effectively a dictionary-encoding problem: map an arbitrarily-typed key (integer, string, composite) to a dense integer group-id in [0, K) where K is the number of distinct keys.
Aggregation: reduce all rows assigned to the same group-id to a single scalar per aggregate column (e.g., sum all values from column C for group-id 3), using an aggregate function such as SUM, COUNT, MIN, MAX, AVG, etc

These two subproblems are algorithm-agnostic: the same logical goals can be achieved via two fundamentally different physical strategies.

The sort-aggregate approach sorts all rows by key first, after which identical keys are contiguous and can be reduced in a single scan; comparison sort costs O(n log n), while radix sort can be linear for fixed-width keys.
The hash-aggregate approach builds a hash table mapping each distinct key to its running accumulator, updating it in expected O(n) time, no sort required, but concurrent writes to shared buckets introduce contention.

The algorithm must also be adapted to the number of keys and columns being operated on as well as the kind of data types used for partition and aggregation. Simple primitive types like float and int come with hardware and basic language support, while more advanced ones like string and datetime require specialized handling.

Recently, a big use case has been supporting user-provided functions (UDFs) for aggregation, which come with their own challenges, mainly the requirement to compile them first into low-level code before running them efficiently.

CPU vs GPU Challenges

Various kinds of CPU-focused solutions have been developed over decades and evolved with CPU cache hierarchies, branch prediction, and thread-level parallelism. GPUs introduce a different execution model: aggregation algorithms must explicitly manage the memory hierarchy, limit global atomic contention, minimize warp divergence, and keep key-comparison logic executable entirely on-device.

Topic	CPU	GPU
Parallelism model	Few powerful cores, usually with private per-thread or per-core aggregation state.	Thousands of threads run together, so shared output state can become a serialization bottleneck.
On-chip memory management	Hardware caches absorb much of the reuse automatically.	Shared memory is small, explicit, and central to fast aggregation.
Memory access pattern	Random probes mostly stall the issuing core.	Scattered warp accesses can waste memory bandwidth. Coalesced access is a must.
Atomic contention	Engines avoid shared state with private accumulators and merge phases.	Low-cardinality groups can serialize thousands of global atomic updates to the same output slot.
Instruction divergence	Branchy probe loops mainly affect one core's pipeline prediction logic.	Divergent probe lengths serialize lanes within a warp and reduce overall utilization.
Output size & memory provisioning	Hash tables and output buffers can grow during execution.	Buffers are usually sized before kernels launch.
Key comparison & hashing	Hash and equality functions are ordinary host code.	Comparators and hashers must be device-callable.
Spill & bounded memory	Spill to disk or remote storage is a mature execution path.	Fast paths generally assume working state fits in device memory; GPU spill support is still maturing.
Large-scale & distributed execution	Distributed engines have mature shuffle, spill, and fault tolerance.	GPU clusters add high bandwidth but GPU-to-GPU shuffle and fault-tolerance tooling is still maturing.

I’ll focus on the high-level algorithm libcudf uses to compute a simple Groupby + Sum aggregation on the GPU, taking you through its flow from the initialization of data structures to identifying unique groupings, and aggregating the data into final output buffers. Throughout, I’ll highlight the libcudf CUDA kernels used, how they take advantage of the GPU’s limited but very fast shared memory to update intermediate results in a massively parallel way, and use block synchronization to ensure threads remain in lockstep. I’ll also describe some of the other libraries libcudf relies on, such as cuCollections for the static sets and hashmaps, and the Thrust library for lower-level data-parallel algorithms like scatter and for_each. Future posts will provide a more in-depth look at an actual run of the algorithm and its performance, also introducing a new visualization tool I have developed to understand the flow of interaction between CPU and GPU.

2. Setup: Software and Hardware used

Library Versions

The investigation was performed on RAPIDS v26.02.00, released on February 4, 2026. The cuCollections dependency used is pinned to commit d3701ae.

The code examples in this post link to my own annotated forks that include additional comments to aid understanding:

My cuDF fork: github.com/jazracherif/cudf, v26.02.00_analysis
My cuCollections fork: github.com/jazracherif/cuCollections, v26.02.00_analysis

GB10 Device

This analysis was performed on the DGX Spark running the GB10 NVIDIA GPU (Blackwell). Here are some key hardware specs relevant to this analysis to keep in mind:

Spec	Value
Architecture	Blackwell (SM 12.1)
Streaming Multiprocessors	48 SMs
CUDA cores	6,144 (128 per SM)
Shared memory per SM	100 KB (max per block: 99 KB)
L2 cache	24 MB
Memory	128 GB LPDDR5x, unified (CPU + GPU share the same pool, zero-copy via ATS)
Memory bandwidth	~273–301 GB/s
Host CPU	1× Grace (20-core Arm Neoverse V2)

The unified memory architecture means there is no PCIe transfer step for the input table, the Arrow Parquet file is read directly into the shared pool and is immediately accessible by both CPU and GPU. The memory bandwidth figure (~273–301 GB/s) will be the primary bottleneck for this workload, as the hash-set initialization and Interlude scans are purely bandwidth-bound.

Code Invoked

The input dataset is the order table from TPCH with 100 million rows (1.8 GB Parquet file). Future posts will explore much larger datasets. The below libcudf C++ code is invoked on an ingested table stored in the Apache Arrow format:

// Assume Table already loaded into GPU memory
cudf::table_view tv = cudf_table->view();

// Create GroupBy operator by specifying the `key` column to group on
cudf::groupby::groupby gb(cudf::table_view{{tv.column(src.key_col)}});

// create aggregation for each column, here only 1 SUM agg
cudf::groupby::aggregation_request req;
req.values = tv.column(src.value_col);
req.aggregations.push_back(cudf::make_sum_aggregation<cudf::groupby_aggregation>());

// Aggregate on default stream
auto [result_keys, agg_results] = gb.aggregate({req});

The equivalent SQL command is:

SELECT   o_orderstatus,
         SUM(o_totalprice) AS total_price
FROM     orders
GROUP BY o_orderstatus;

The goals are:

Understand how libcudf selects and executes the groupby sum path using a string key and float64 column.
Break down the algorithm into understandable pieces and show where in the code these are implemented
Identify the main GPU kernel launch to its source location in cuDF and associated libraries.
Explain the two-level shared-memory aggregation strategy that libcudf uses to reduce global atomic contention.

In a followup post, I will cover the following:

Capture a real run of the algorithm with real Nsight Systems on GB10: confirms kernel names, ordering, and timing on a 100M-row workload.
Review each kernel performance with Nsight Compute
Look at the flow of messages using a custom viewer I have developed.

With 100M input rows that need to be reduced into K distinct o_orderstatus values, a naïve GPU approach, one global atomic-add per row directly into the output column, will suffer from severe memory contention, particularly when cardinality is low. cuDF avoids this by staging the reduction through shared memory.

3. Architecture at a Glance: The Four-Phase Data Flow

The diagram below shows the overall execution of the aggregation from the GPU’s perspective (from left to right), covering kernels and main data structures used.

Use it as a map that links together all the details. The following sections below will zoom into each region: the kernel implementations (§5), and a step-by-step trace (§6), and more details about the global_set structure (§8)

Four algorithmic steps are highlighted:

Initialization
Block Level Membership and Index Mapping
Interlude: Dense output index remapping
Final step: Shared-memory accumulation + flush

The diagram highlights the different data structures stored in global memory and those in the shared memory, with reference to the 100M row dataset.

Global memory (visible to all blocks):

Input: o_orderstatus, o_totalprice (N=100M rows)
Intermediary: global_set (800 MB, 200M slots each 4 bytes), local_mapping_indices (one int32 per row), global_mapping_indices (128 int32 entries per block)
Output: total_price (K values)

Shared memory (private to each SM, discarded after the kernel):

shared_set / __shared__ slots[128]: phase 2 only; the cuco::static_set_ref hash map backing store, used by find_local_mapping to probe for key existence and assign block-local ranks via CAS
shared_set_indices[128]: phase 2 only; parallel flat array mapping each block-local rank to the first input row that claimed it (rank → representative row-index)
shmem_agg_storage: phase 4 only; dynamic partial-sum accumulator storage indexed by block-local rank. For one double SUM over 128 ranks, the logical accumulator is 1 KB before alignment and any additional per-aggregation layout overhead.

4. The GroupBy.Sum() Algorithm

Initialization:
- Before any row is processed, a global hash set (global_set) is initialized to size 2x the input (200M slots) with a SENTINEL value (spawning a cub::detail::for_each on the gpu side).
```
global_set[0..200M) <- SENTINEL
```
Block Level Membership and Index Mapping:
- This phase reads the key column and determines which o_orderstatus group every input row belongs to. At this point the grouping is only local to a block. Each CUDA block uses a private hash table in shared-memory to map its rows to at most 128 distinct keys, assigning each a block-local rank (see below what is a block-local rank?).
- For each new key, using CAS (the compare-and-swap atomic instruction), it atomically elects a single representative row where that key was first seen and inserts it into global_set to be used across all blocks.
- Two index arrays are maintained for use in later phases: 1) the local_mapping_indices stores the block-local rank value allocated to each row later used to generate the per block aggregation, and 2) the global_mapping_indices stores the winning row for each rank slot of that block, later turned into a global ranking.
```
local_mapping_indices[row]        -> the block-local rank assigned to row  
global_mapping_indices[blk*128+r] -> The representative row index for each rank slot in each block
global_set insert/find(rep_row)   -> The winning representative row at the key hash slot
```
Interlude: Dense Output Index Remapping:
- Between the two main kernels, a set of device operations scans global_set (via retrieve_all / cub::DeviceSelect::If) to collect the K representative row-indices, then builds a dense output ordering (0..K-1) via thrust::scatter, and rewrites global_mapping_indices in-place via thrust::for_each_n so every block agrees on the same output slot for each group. These algorithms launch over the full input or hash-set range even though only K ≪ N slots are populated, trading excess thread count for uniform, divergence-free execution that saturates memory bandwidth.
```
global_mapping_indices[blk*128+r] -> dense output index in total_price[0..K-1]
```
Shared-Memory Accumulation + Global Reduction:
- Now that membership and output ordering are known, each block accumulates its assigned o_totalprice values entirely within shared memory (no cross-block, no global atomics yet).
- Each block then flushes only up to 128 partial o_totalprice sums to the correct output slot using the remapped global_mapping_indices, one atomic-add per distinct o_orderstatus value per block rather than one per row.
- For this dataset the number of global atomics is reduced by a factor of roughly 100M / (num_blocks × avg_labels_per_block) compared to the naïve approach.
- The logic is repeated based on the number of aggregation outputs to produce and available shared memory within a block
```
r = local_mapping_indices[row]
shmem_price_accum[r] += o_totalprice[row]
global_label_idx = global_mapping_indices[blk*128+r]
total_price[global_label_idx] += shmem_price_accum[r]
```

Note: Phase 2 and Phase 4 communicate through the index arrays produced by Phase 2 and rewritten by the Interlude; no inter-block GPU synchronisation is needed between Phase 2 and Phase 4.

What is a block-local rank?

The key idea behind the fast path is that each CUDA block first locally deduplicates the keys among the rows it processes, then connects those per-block results to the final global output groups. This is done via block-local ranks.

Each CUDA block assigns a small integer, starting from 0, to each distinct o_orderstatus value the first time it is encountered among that block’s assigned rows. That integer is the block-local rank: a dense index into the block’s private shared-memory accumulator array. In the fast path, valid ranks are 0..127. This numbering is private to this block; another block may assign rank 0 to “O” or any other o_orderstatus.

The Interlude phase converts the representative row-indices stored in global_mapping_indices after Phase 2 into final dense global output indices (0..K-1), where K is the total number of unique keys across all rows. This ensures that all blocks agree on the same output slot for each group before phase 4 runs.

5. Deep Dive into each phase: Kernels and Data Structures

Now that the data flow and hash-set mechanics are established, this section revisits the same phases at the level of the actual kernels and helper functions. All the kernels below are invoked from a single host function, compute_single_pass_aggs().

Phase 1: Hash set initialization

Before any row is processed, a cub::detail::for_each kernel sweeps all 200M slots of global_set and writes the SENTINEL value (typically INT32_MAX) to each one. This establishes the “empty” state that insert_and_find’s CAS loop uses to distinguish occupied from free slots. At 4 bytes × 200M slots = 800 MB of writes, this kernel is purely memory-bandwidth-bound (~4.1 ms on this dataset).

Phase 2: Key insertion and index mapping

Every input row is processed by the mapping_indices_kernel kernel. For each row, the thread performs three steps:

Block-local deduplication: find_local_mapping() inserts the row’s key into shared_set, a block-private mini hash table cuco::static_set_ref backed by __shared__ slots[] (capacity = GROUPBY_CARDINALITY_THRESHOLD = 128 unique keys). shared_set is used only for existence checks (new key vs. duplicate); a separate flat __shared__ array shared_set_indices[rank] = row_idx maps each block-local rank to the first input row that claimed it. local_mapping_indices[row] is written with the block-local group rank (0..127): for a new key it is assigned by atomically incrementing cardinality; for a duplicate it is copied from local_mapping_indices[matched_row] after a block.sync(). local_mapping_indices provides a local per block grouping of the rows that will be re-used in phase 4 of the later accumulation step.
Global key registration: find_global_mapping() iterates over shared_set_indices[0..cardinality-1] and inserts each representative row-index into the global cuco::static_set. The CAS inside global_set.insert_and_find() atomically elects a single representative row for that key across all blocks. The winning row-index is stored in global_mapping_indices[block × 128 + rank]. Only one global insertion is made per distinct o_orderstatus value per block, not per row.
Overflow detection: if cardinality > 128, the needs_global_memory_fallback flag is set and all threads in the block break out of the input loop. After the kernel, the host checks this flag and if set, falls back a slower naïve global-memory aggregation path instead run_aggs_by_global_mem_kernel.

Phase 3, Interlude: Dense output index remapping

When there is no overflow, extract_populated_keys() is invoked to extract unique key row-indices from global_set into a contiguous buffer via cuco::static_set::retrieve_all(), which fires two CUB kernels (DeviceCompactInitKernel + DeviceSelectSweepKernel).

The key transition in this phase is the meaning of global_mapping_indices:

Before Interlude:

global_mapping_indices = representative input row-index, in range [0..N-1]

After Interlude:

global_mapping_indices = dense output index into total_price[], in range [0..K-1]

This is done in 2 steps:

A compute_key_transform_map() step builds the dense renumbering (key_transform_map) that maps any representative input row-index to a compact output slot [0, K):
```
key_transform_map[representative_input_row_idx] 
     = output_group_index   (0..K-1)
```
A second thrust::for_each_n kernel then rewrites global_mapping_indices in place using this map so that every entry holds a finalized output group index.

Phase 4: Shared-memory accumulation + flush

This phase is implemented by a single kernel, single_pass_shmem_aggs_kernel

Each block declares extern __shared__ cuda::std::byte shmem_agg_storage[]: a dynamically-sized shared memory buffer laid out by calculate_columns_to_aggregate() as num_agg_columns × cardinality × sizeof(element_type) bytes (plus alignment padding), where cardinality ≤ GROUPBY_CARDINALITY_THRESHOLD = 128.

The kernel breaks the computation into a loop covering a number of aggregation output columns based on available shared memory, with the inner loop running the following two sub-phases:

┌─ Sub-phase 1: per-row accumulation into shared memory ──────────────────────┐
│  For each `row` assigned to a block, use previously generated               |
| `local_mapping_indices` to aggregated rows with same key in each block:     │
│    shmem_agg_storage[local_mapping_indices[row]] += source_value[row]       │
│    (via cudf::detail::atomic_add into shared memory)                        │
└─────────────────────────────────────────────────────────────────────────────┘
                              |
                          block.sync()
                              |
                              V
┌─ Sub-phase 2: flush partial results to global output columns ───────────────┐
│  For each `unique key` resident in this block:                              │
│    target_global_col[global_mapping_indices[blk×128+rank]]                  │
│        += shmem_agg_storage[rank]                                           │
│    (via cudf::detail::atomic_add into global memory)                        │
└─────────────────────────────────────────────────────────────────────────────┘

target_global_col will contain the final aggregation value for each column.

The global atomic_add in sub-phase 2 is reached via an inlined two-level compile-time template dispatch (type_dispatcher × aggregation_dispatcher) that resolves the runtime column type and aggregation kind to a single pre-compiled specialization with no GPU branching. For SUM on double input (o_totalprice), this lands at update_target_element_gmem, which calls cudf::detail::atomic_add directly.

Wrapup: Output Key Gather

After aggregation, the unique key row-indices retrieved from the hash set are used to gather the corresponding rows from the original input keys table into a dense output keys table:

output_keys[i] = input_keys[unique_key_indices[i]]   for i in [0, K)

For string key columns this gather requires a multi-step CUB prefix scan over character offsets followed by a parallel character copy kernel (gather_chars_fn_char_parallel).

6. Step-By-Step Example of the algorithm: from input rows to final output indices

The example below traces the whole algorithm with two small blocks. The values are artificial, but the roles of local_mapping_indices, global_mapping_indices, unique_key_indices, and key_transform_map match the real execution.

Setup:

2 blocks (B0, B1),
GROUPBY_CARDINALITY_THRESHOLD = 128
K=3 unique aggregation key values, "F", "O", and "P"
MurmurHash3 slot assignments in the 200M-slot global_set:
- hash("F")%200M = 47_000_000
- hash("O")%200M = 103_000_000
- hash("P")%200M = 182_000_000.

Step 1: Input partitioning

Each block is assigned a contiguous slice of the 100M input rows:

Block0 (rows 1000..1004):

Row	Key
1000	"F"
1001	"O"
1002	"F"
1003	"P"
1004	"O"

Block1 (rows 5000..5004):

Row	Key
5000	"O"
5001	"P"
5002	"O"
5003	"F"
5004	"P"

Step 2: Phase 2 block-local rank assignment + global set insertion (`compute_mapping_indices`)

Each block builds a private shmem hash set, assigning a rank to each new key on first encounter.
For every key that is new to that block, it calls insert_and_find(row_idx) on the shared global_set (200M slots, cuda::thread_scope_device) to claim a globally unique slot via CAS.
insert_and_find returns {iterator_to_slot, bool_inserted}.
Dereferencing the iterator (*it) yields the row index stored in that slot; always the winning thread’s row_idx, regardless of which thread won the CAS race.
That row index is what gets written to global_mapping_indices.

Assume Block0 wins the global CAS races, and each block assigns local ranks in first-seen row order. local_mapping_indices maps each input row to its block-local rank:

Block0 first sees “F”, then “O”, then “P” → F=rank0, O=rank1, P=rank2
Block1 first sees “O”, then “P”, then “F” → O=rank0, P=rank1, F=rank2

local_mapping_indices: block-local rank per row:

Row	Value	Description
1000	0	"F" → rank 0 (first seen) - Block0
1001	1	"O" → rank 1
1002	0	"F" duplicate → rank 0
1003	2	"P" → rank 2
1004	1	"O" duplicate → rank 1
...
5000	0	"O" → rank 0 (first seen) - Block1
5001	1	"P" → rank 1
5002	0	"O" duplicate → rank 0
5003	2	"F" → rank 2
5004	1	"P" duplicate → rank 1
..
N-1

global_set after Phase 2 (200M slots, only 3 occupied), stores the winning representative row for this key, all from Block0 rows.

Slot	Value	Description
hash("F") % 200M	1000	First winning row with key "F"
hash("O") % 200M	1001	First winning row with key "O"
hash("P") % 200M	1003	First winning row with key "P"
all other ~199M slots	SENTINEL	Empty

global_mapping_indices after Phase 2 contains representative input row indices, not yet mapped to the dense output grouping. Since B0 won the CAS races, B0’s winning rows are what gets stored:

Index	Value	Description
[0×128 + 0]	1000	B0 rank 0 ("F") → winning row 1000
[0×128 + 1]	1001	B0 rank 1 ("O") → winning row 1001
[0×128 + 2]	1003	B0 rank 2 ("P") → winning row 1003
[0×128 + 3..127]	SENTINEL	Unused B0 slots
[1×128 + 0]	1001	B1 rank 0 ("O") → uses B0 winning row 1001
[1×128 + 1]	1003	B1 rank 1 ("P") → uses B0 winning row 1003
[1×128 + 2]	1000	B1 rank 2 ("F") → uses B0 winning row 1000
[1×128 + 3..127]	SENTINEL	Unused B1 slots
..	SENTINEL	Unused
NBLOCKS * 128 - 1	SENTINEL	Unused

Note: B1 also attempted to insert “O”, “P”, and “F” but the CAS returned DUPLICATE. The iterator still points to the existing slot, so *it gives the same row index B0 stored. Both blocks therefore agree on the same representative input row index per key.

Step 3: `extract_populated_keys()`: compact global_set → unique_key_indices

retrieve_all() scans global_set linearly from slot 0 to slot 199M via cub::DeviceSelect::If, collecting the row-index values stored in each non-SENTINEL slot:

scan order: slot hash("F")%200M comes first, then hash("O")%200M, then hash("P")%200M
            (i.e. in ascending slot-position order, regardless of insertion order)

unique_key_indices = [1000, 1001, 1003]   ← representative input row index per slot, in slot-scan order
                       i=0    i=1    i=2

These are the same row indices already in global_mapping_indices, just deduplicated by scanning the hash table. Their position in unique_key_indices (0, 1, 2) defines the dense output row each key will occupy.

Step 4: `compute_key_transform_map()`: invert unique_key_indices via thrust::scatter

Scatters counting values 0, 1, 2 to positions unique_key_indices[0,1,2]. The result is an array of size N (number of input rows), where each populated index is the representative input row index mapped to its final dense output row.

Index	Value	Description
[1000]	0	Row 1000 ("F") → dense output row 0
[1001]	1	Row 1001 ("O") → dense output row 1
[1002]	-	-
[1003]	2	Row 1003 ("P") → dense output row 2
all other ~99M entries	(uninitialized)	Irrelevant; never read

Step 5: `thrust::for_each_n`: rewrites global_mapping_indices in-place with dense output rows

Each non-SENTINEL entry (a representative input row index in 0..N-1) is replaced with key_transform_map[old_idx] (the corresponding dense output row in 0..K-1). The representative rows 1000, 1001, and 1003 are not usable as output indices directly; there are only K=3 output rows, so they must be remapped to 0, 1, and 2:

global_mapping_indices after remapping (dense output indices, replacing representative row indices): Notice that the ranks have the same value across all blocks; it is a global mapping.

Index	Value	Description
[0×128 + 0]	0	B0 rank 0 ("F") → output row 0
[0×128 + 1]	1	B0 rank 1 ("O") → output row 1
[0×128 + 2]	2	B0 rank 2 ("P") → output row 2
[0×128 + 3..127]	SENTINEL	Unused B0 slots
[1×128 + 0]	1	B1 rank 0 ("O") → output row 1
[1×128 + 1]	2	B1 rank 1 ("P") → output row 2
[1×128 + 2]	0	B1 rank 2 ("F") → output row 0
[1×128 + 3..127]	SENTINEL	Unused B1 slots
..	SENTINEL	Unused
NBLOCKS * 128 - 1	SENTINEL	Unused

Step 6: Kernel 2, accumulate + flush (`compute_shared_memory_aggs`)

Now we have a mapping from every row to its block-local accumulator, and from every block-local accumulator to its global output row. Each block reads its rows, accumulates o_totalprice into shmem using local_mapping_indices[row] as the shmem slot, then flushes at most 128 partial sums to global memory using global_mapping_indices[block*128 + local_rank] as the total_price output index.

7. Algorithm Complexity Summary

Assuming the following:

N = total number of input rows (100M in this dataset).
K = number of distinct groupby keys.
capacity = hash-table size (2N slots = 200M).

Stage	Time complexity	Dominant cost
Phase 1: hash set init	O(N)	Memory bandwidth: write sentinel to 2N slots (~4.1 ms)
Phase 2: key insertion + local mapping	O(N) avg	Hash probing + atomic inserts
Phase 3 Interlude: unique key extraction + dense index remap	O(capacity) = O(2N)	NOT O(K), `retrieve_all()` must scan every one of the 200M hash-table slots to find the K occupied ones. Cost is fixed by table size, not by the number of distinct keys (~3.4 ms even when K=3)
Phase 4: SUM accumulation	O(N)	Shared-memory atomics (fast) + global atomics (flush)
Key gather	O(K + total key bytes) for strings	Offset scan + character copy

Total: O(N) average with low constant factors when cardinality ≤ 128 groups per block. The asymptotic result is simple; the practical win comes from changing global atomic frequency from per-row to per-block-per-group.

8. Appendix: A Deeper Look into hash set global_set

The hash groupby is built around a device-side open-addressing hash set (cuco::static_set), referred to as global_set in the code, that stores one representative input row-index per unique key. It does not store the aggregation key values directly; instead, each stored row-index points back into the original key column, and the row hasher/comparator use that row to hash and compare the key value.

Since many rows can have the same aggregation key, insert_and_find() uses CAS (compare-and-swap) to claim empty global slots and elect one representative row for each key across all blocks. Each block first maintains its own block-private shared_set in shared memory to deduplicate rows locally, then only the block-local representative rows are inserted/looked up in global_set.

Multiple blocks may attempt to register the same key, but only the first successful CAS writes that key’s global representative row-index into the set. In order to minimize collision cost without knowing the distinct key count in advance, the set’s capacity is sized for the worst case in which every input row has a distinct key: twice the number of rows in the dataset.

global_set slot layout:

N = 100M rows
load factor = 50%
capacity = 2 × num_input_rows = 200M slots

Example content with 2 unique key values:

#	value	Notes
0	EMPTY
1	EMPTY
2	7	Row 7 has a unique `o_orderstatus` value that hashes into this slot
3	EMPTY
…	…
1000	12	Row 12 has a different unique `o_orderstatus` value that hashes into a different slot in this set
…	EMPTY
199M	EMPTY

Set design

The hash set is constructed in compute_groupby() with the following specifications:

Key type: int32_t (cuDF size_type). Row hashing and equality comparison are performed by cuDF’s row comparator against the o_orderstatus (utf8) column. MurmurHash3 over character bytes, byte-wise equality.
Capacity: 2 × N slots where N = num_input_rows (used as a worst-case upper bound for distinct key count; CUCO_DESIRED_LOAD_FACTOR = 0.5). For N = 100M rows: 200M slots × 4 bytes = 800 MB. Construction fires cub::detail::for_each::static_kernel> to fill all slots with the sentinel in parallel. For 100M rows, initialization costs 4.105 ms, ~23.5% of total groupby kernel time.
Probing scheme: cuco::linear_probing<1, row_hasher_with_cache_t>. Linear probing with CGSize=1 (each probe step is handled by a single thread, advancing one slot at a time), with an optional row-hash cache (pre-computed hashes stored in a device_uvector).
Thread scope: cuda::thread_scope_device. All GPU threads can access the same set.
Sentinel: CUDF_SIZE_TYPE_SENTINEL = INT32_MAX. Marks empty slots.
Memory: rmm::mr::polymorphic_allocator. Backed by the caller-supplied RMM pool.
Storage layout: cuco::storage. Two-level slot hierarchy: array of buckets, each holding BucketSize contiguous slots. BucketSize > 1 lets a thread probe multiple slots per step (beneficial for memory-bandwidth-bound workloads). For cuDF GroupBy, hardcoded to GROUPBY_BUCKET_SIZE = 1 (flat per-slot probing); appropriate here since key cardinality is low and contention is minimal.

Finding/Inserting a key in the set

The set stores row indices (int32_t), not actual key values. When the set needs to hash or compare a candidate slot, it calls back into the original input column data on the GPU (via d_row_hash). This indirection is set up before any kernels run, in dispatch_groupby():

preprocessed_table::create(keys, stream): copies the column_device_view metadata structs (data pointers, null masks, type IDs) into a GPU buffer so kernels can dereference them. The actual column bytes were already in GPU memory via RMM. Cost: ~143 bytes (one string column’s metadata, as seen in the RMM trace).
self_comparator: host factory that wraps the preprocessed_table and produces device_row_comparator, a GPU callable implementing operator()(i, j) → byte-wise string equality via type_dispatcher.
row_hasher: same pattern; produces device_row_hasher, a GPU callable implementing operator()(i) → MurmurHash3 over all columns of row i. Both share the same preprocessed_table via shared_ptr to avoid a redundant GPU upload.

These two callables are then embedded directly into the cuco::static_set constructor as the probing scheme and equality comparator, so every insert and lookup the set performs reaches back into the original key column memory.

insert_and_find(i) logic for row index i:

1. slot = d_row_hash(i) % 200M_slots          ← initial probe position from o_orderstatus string bytes

2. occupant = *slot
  pre-CAS check: d_row_equal(i, occupant)    ← does the row stored in this slot match row i's key?
      EQUAL     → return {slot, false}         ← key seen before; occupant is the representative (no CAS needed)
      AVAILABLE → go to step 3                 ← slot is empty (SENTINEL); attempt insert
      UNEQUAL   → slot += 1, repeat step 2    ← occupied by a different key; linear probe

3. CAS(slot, SENTINEL, i)                     ← atomically try to claim this empty slot
      SUCCESS   → return {slot, true}          ← we won; row i is now the representative
      DUPLICATE → return {slot, false}         ← another thread won the same key; slot holds the representative
      CONTINUE  → repeat step 2 at same slot  ← a different key raced us here; re-probe from this slot

Phase 2 kernel mapping_indices_kernel uses this operation in two scopes. First, each row probes the block-private shared_set to get a block-local rank. Then only the rows that represent keys new to that block probe the global global_set, where the CAS in step 3 performs the cross-block election: whichever thread wins the compare-and-swap for a given o_orderstatus value becomes the globally agreed representative row for that key. The CONTINUE result (a raced-but-different-key loss) sends the thread back to re-evaluate the slot it just lost, not to advance, since the winner may have written a key equal to i.

NVIDIA GTC 2026 Accelerated Analytics - Part 2: Industry Use Cases and Training Labs

2026-04-17T07:00:00+00:00

This is Part 2 of my series on Accelerated Analytics at GTC 2026, focusing on 3 industry talks and 2 DLI training workshops. Read Part 1: Technical Deep Dives.

This post tackles the following sessions:

Quais Taraki (CTO, EDB) shows how standard Postgres breaks under agentic query loads and walks through PGAA — a GPU-accelerated HTAP solution that swaps the Postgres compute back-end for Iceberg + Spark RAPIDS, achieving 100× TPC-DS speedup and enabling a complete LangFlow-based agentic stack on top.
Liang Chen and Prudhvi Vatala from Snap detail how Spark RAPIDS cut A/B pipeline costs by 90% — not through any Spark tuning magic, but by rerouting 11,000 idle inference L4s into a three-tier fallback Spark fleet at near-zero incremental cost.
Harishankar G and Jalakandeshwaran A from Zoho give a deep dive into Velociraptor, their in-house GPU OLAP engine built as a Postgres extension, which runs all 22 TPC-H queries at 1 TB on a single H200 in under two minutes — then explain why PCIe is still the bottleneck even after every I/O optimization.
Hirakendu Das, Navin Kumar, and Rishi Chandra lead a hands-on Spark RAPIDS workshop covering the cuDF plugin, Project Aether’s automated qualify → tune → validate loop, and Ether Assistant’s LLM-based UDF rewriter.
Allison Ding walks through a full GPU data science pipeline — from zero-copy feature engineering with cuDF and GPU Polars, through cuML model training (k-means 40×, XGBoost 7×), to Triton Inference Server deployment with dynamic batching.

Industry Use Cases

🔗 [EDB] Supercharging Postgres for Agentic Analytics with Rapids Accelerator and Apache Iceberg

Quais Taraki · CTO, EDB

NVIDIA Session overview

NVIDIA GTC 2026 Accelerated Analytics - Part 1: Technical Deep Dives

2026-04-09T07:00:00+00:00

Accelerated Analytics for structured and unstructured data had a strong presence at this year’s GTC conference. First in the keynote, CEO Jensen Huang spent a good 20 minutes discussing how Enterprise AI offerings are powered by NVIDIA, with his “favorite” slides featuring NVIDIA’s RAPIDS libraries cuDF and cuVS sitting at the bottom of the whole software ecosystem for acceleration. See my post covering the full keynote for more.

Then there were many sessions covering these developments. In this post, I cover the four main technical ones.

Joshua Patterson and Todd Mostak open with a state-of-the-union on GPU-accelerated data processing where CPU analytics performance has stalled, how the NVIDIA ecosystem closes the gap, and what a next-generation analytics cluster looks like.
Greg Kimball and Zoltán Arnold Nagy then zoom into Presto specifically, walking through the concrete engineering required to turn GPU acceleration from theory into production reality at lakehouse scale.
Bobbi Yogatama and Xiangyao Yu bring that story down to a single node, showing how their Sirius extension turns DuckDB into a record-breaking analytics engine without changing a single query.
Finally, Felipe and Rodrigo Aramburu go one layer deeper with cuCascade and a custom telemetry tool, the composable building blocks behind Sirius’s ability to handle datasets far larger than GPU memory.

This is Part 1 of my series on Accelerated Analytics at GTC 2026. Read Part 2: Industry Use Cases and Training Labs.

Technical Deep Dives

🔗 The Era of GPU Data Processing: From SQL to Search and Back Again — S81769

Joshua Patterson · VP, Solutions Architecture, NVIDIA
Todd Mostak · Sr. Director of Engineering, NVIDIA

NVIDIA Session overview

GTC 2026 Keynote — Part 3: Vera Rubin Hardware, OpenClaw & Robotics

2026-04-05T07:00:00+00:00

This is Part 3 of a 3-part breakdown of the GTC 2026 keynote. Start with Part 1: Overview & Context or go back to Part 2: Intro, Analytics, CUDA-X & Inference. The single-page version is also available.

Previously in Parts 1 & 2: After setting the scene at GTC, Jensen spent the first half of the keynote celebrating CUDA’s 20-year flywheel, making the case for NVIDIA’s role in accelerating enterprise analytics (with partnerships from IBM, Dell, and Google Cloud), reviewing the CUDA-X library ecosystem, and laying out the economics of the AI inference boom, framing the $1T infrastructure wave ahead and how GB300 NVL72 became the inference king on tokens-per-watt.

Summary of Part 3 sections

The second half of the keynote covered the following sections:

Duration	Section
38 min	Full Vera Rubin hardware stack + DSX platform — Showing Vera Rubin + Groq hardware and explaining how they improve the throughput vs. interactivity performance curves
19 min	OpenClaw, NemoClaw, Open Model Coalition — Praising the explosive growth of OpenClaw as a revolutionary moment, and announcing NVIDIA’s enterprise reference NemoClaw and the open model coalition
14 min	Robotics, Physical AI, & recap — Describing the evolution of physical AI and the robotic landscape and recaping with a specially generated music video

Full Vera Rubin hardware stack — GPU, NVLink, Rubin Ultra, and Spectrum-X Groq LPX + DSX platform for AI factory optimization (38min)

A Decade of AI Infrastructure Innovation: From DGX-1 to Vera Rubin · 3:30min

Jensen narrates NVIDIA's decade of data center infrastructure innovation:

2016

DGX-1 —packages 8 Pascal GPUs, first supercomputer built for deep learning, one delivered to openAI that year

2017

Volta — introduces NVLink 2 switch, GPU-to-GPU interconnect inside nodes

2019

Mellanox acquisition — allows the data center to become a single unit of computing

2020

Ampere / DGX A100 SuperPOD — brings scale-up via NVLink 3, scale-out via ConnectX-6 InfiniBand

2022

Hopper — supports FP8 Transformer Engine for Gen AI, NVLink 4, ConnectX-7

2024

Blackwell / NVL72 — achieves 130 TB/s bandwidth and a deeper rack-level co-design for top performance

2026

Vera Rubin — built for agentic AI · 35× throughput/MW · 40M× cumulative compute over the decade

NVIDIA Vera Rubin · 2:27min

Jensen introduces the Vera Rubin hardware on stage

NVIDIA Vera Rubin, NVLink and Groq · 1:36min

He makes some interesting observations: with the recent tray designs, installation time falls down from 2 days to 2 hours. Also cooling is done with hot water at 45 degrees.

Spectrum-X Switch, Co-Packaged Optics, Vera and BlueField-4 · 2:09min

discusses the 8 grok 3rd gen tray which is in production and shows the Spectrum Co-packaged optics switch. Vera brings 2x performance per watt. ConnectX9 and storage platform are powered by Vera CPU.

Rubin Ultra · 2:03min

Jensen also shows VR Ultra and the new Kyber rack that can connect 144 gpus that now slide vertically into the rack. He also shows the new NVLink tray design that sits behind, also vertically.

Inference Performance and Efficiency Drive Company Results · 9:35min

Jensen's main message to CEOs is how they will need to evaluate their company's usage of tokens, and study the tradeoff between throughput (as Token per Sec per MW) vs Interactivity (as token per second per user). Input and output Context length are growing and usage depends on use case. Jensen shows a graph partitionned by kind of model at different prices and how nvidia's chips performs on this tradeoff. The value of Ultra lays enabling bigger more interactive models with better energy efficiencies. GB NVL72 has increased the medium tier by 35x and Vera rubin will increase high tier by 3x and increased premium tier by 10x. Rubin + Groq LPX increase most valuable tier by 35x. Ultra enables even better interactivity.

Uniting Processors of Extreme Performances · 3:36min

Jensen delves into the performance of Groq, which has high SRAM capacity (500MB) at very high throughput (150TB). This complements Rubin's 288GB of HBM4 memory at 22TB/s by providing statically compiled compute primitives specially used for the decode Feed Forward phase of AI inference, and helps achieve very low latency for token generation.

NVIDIA Groq 3 LPX · 0:38min

Jensen shows Groq LPX manufactured by samsung and say he expects to ship by Q3 this year.

Announcing NVIDIA Launch Partners · 1:56min

shows all the AI labs, cloud, and OEM/ODM that will launch Vera Rubin. Expects production in the 1000s per week. also shows launch partners for Vera CPU and BlueField storage systems

NVIDIA Vera Rubin: 7 Chips – 5 Rack Systems · 1:02min

Jensen shows how much progress was made by comparing x86 hopper generation to Vera Rubin GiGaWatt factory. VR can generate 350x more tokens per seconds than Hopper thanks to 35x more scale up BW per Rack (at 288TB/s) and with half as many GPUs.

NVIDIA Extreme Co-Design Delivering X-Factors Every Year · 3:37min

shows the roadmap to 2028 with Feynman. Oberon will enable scale up in both copper and optical to support NVL576 racks (Kyber) and then NVL1152 for Feynman with Kyber.

NVIDIA DSX AI Factory Platform · 2:10min

Jensen describes the importance of the NVIDIA Omniverse solution to help design GW factory digital twins and reach max performance at lowest possible energy usage. He talks about tools for simulation such as DSX Sim, DSX exchange, DSX flex power management and DSX Max Q for dynamic power adjustment in the data center.

How AI Factories Maximize Tokens, Power, and Profit With NVIDIA DSX · 3:25min

The video summarizes all the components of the DSX AI factory platform

Space-1 Vera Rubin Module · 0:43min

Jensen briefly mentions NVIDIA's foray in space with Space-1 Vera Rubin module and mentions the challenge of cooling in space.

OpenClaw, NemoClaw, Open Model Coalition (19min)

NemoClaw for OpenClaw · 1:24min

Jensen is very excited about OpenClaw, the most popular open source in history, with the fastest project to get the most stars in github

OpenClaw: The ChatGPT Moment for Long-Running, Autonomous Agents · 9:14min

He shows how openclaw grew as a project to 340k stars on GitHub since the end of january 2026. It is the operating system of agents and every enterprise will soon need an OpenClaw strategy.

NVIDIA Nemotron and Open Models · 0:28min

Jensen announces new models in Nvidia's open foundation model families: bioNemo for biomedical AI, earth-2 for Ai physics, Nemotron for Agentic AI, Cosmos for Physical AI, GROOT for Robotics, and Alpamayo for Autonomous Vehicles.

How NVIDIA Open Models Power Every Industry's AI · 4:17min

The video shows models from each of the Nvidia families. They are world class, doing well on benchmarks. Shows nemotron-3-super-120b as #4 on best open model for openClaw. Nemotron 3 ultra.

Announcing Global AI Leaders Join NVIDIA Nemotron Coalition · 2:57min

Jensen announces the NVIDIA Nemotron Coalition¹ aimed at accelerating the co-development of open AI frontier models with partners Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam and Thinking Machines Lab

Announcing NVIDIA NemoClaw Reference OpenClaw · 0:39min

Jensen says the openClaw event cannot be understated and is as big as linux and html. In response, Nvidia is releasing NemoClaw, a reference enterprise-ready solution to secure openClaw deployments inside enterprises.

Robotics, Physical AI, & recap (14min)

Physical AI and Robotics · 3:11min

Jensen talks robots, mentions there are 110 robots at GTC, announces 4 new auto partners: BYD, Hyundai, Nissan, and Geely are joining Mercedes, Toyota, and GM to build robotaxi technologies. Jensen also announces a partnership with Uber to launch a large fleet of autonomous vehicles for 2027 on the NVIDIA DRIVE AV stack²

The Age of Physical AI and Robotics · 4:27min

This video shows how autonomous cars have been improving thanks to NVIDIA's and partner ecosystem.

Olaf Takes the Stage With Jensen Huang · 1:55min

Jensen welcomes the only guest at the keynote. Last year it was a Star Wars inspired robot "blue", this year it is Olaf from Frozen

Official Keynote Closing Video · 4:02min

The Keynote ends with a generated video recapping the keynote with a jensen emoticon playing harmonica in the forest, surrounded by a band of robots playing instruments, a bit silly for my tast but again showcasing the power of the tools

Full keynote is available here and the slides here.

← Part 2: Intro, Analytics, CUDA-X & Inference

References

NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models — nvidianews.nvidia.com
NVIDIA DRIVE Hyperion Achieves Level 4 Autonomy with Uber Partnership — nvidianews.nvidia.com

NVIDIA GTC 2026 Conference: The Keynote

2026-04-05T07:00:00+00:00

Prefer a section-by-section breakdown? This keynote is also available as a 3-part series starting with Part 1.

I was back this year for the 2026 edition of NVIDIA’s GTC conference held at the San Jose Convention Center and surroundings from March 16-19.

Like last year, there was plenty of energy at the conference with attendee numbers said to have reached more than 30k. The conference was packed with interesting technical sessions on new developments in the NVIDIA ecosystem including technical sessions on CUDA-X libraries and industry and state partners presenting how they have integrated the NVIDIA stack into their products.

The conference expanded to the nearby hotels for additional space, the security check-ins were moved out of the convention center and onto the street and an additional lunch section was added in the parking lot in front of the Hylton Hotel on S. Almaden Road.

Finally the keynote was held like previous years at the SAP Center, a 15min walk away, with a larger pavilion setup just outside of it for free coffee and pastries and for hosting the “pre-game” show featuring executives and technical leaders of companies working with NVIDIA. Other than that, the conference looks about the same as last year!

In this post, I will only cover the keynote and will delve into the sessions I attended and the exhibit hall in followup posts.

The Keynote

The keynote was the main event held on the first day of conference and it was moved ahead to 11AM, making it easier to get there early and avoid long lines. Here are some pictures from the packed SAP Center stadium where it was held

As he does every year, Jensen showed hardware on stage, including the new Vera Rubin tray, the new Groq LPX tray, and the new Co-Packaged Optical switch tray for scaling up. He also showed Vera Ultra and its Kyber rack design where trays are inserted vertically instead of horizontally. The exhibit hall had all these nicely on display.

One interesting aspect I wasn’t expecting was Jensen spending 18 minutes almost at the outset of the keynote talking about how NVIDIA’s libraries are sitting at the foundation of accelerated analytics in Enterprise structured and unstructured data. He announced several partnerships with the cloud providers and highlighted how many of NVIDIA’s solutions accelerate CSP’s offerings. I will cover the analytics aspects of the conference in a separate post.

Jensen reveled in being crowned “inference king” by Semianalysis for GB NVL72 system! Also check their review¹ of the GTC conference.

CUDA is 20 years old

CUDA is now 20 years old, and Jensen celebrated that by spending a few extra minutes talking about its core importance to NVIDIA as a company. He emphasized the crucial flywheel role that CUDA-X plays for NVIDIA as an ecosystem of hundreds of libraries for accelerating all kinds of workloads. As the install base for CUDA has grown, reaching hundreds of millions of GPUs deployed around the world, so has the reach to developers, leading to new breakthroughs in many domains, each creating new markets and new customers who then want to buy more GPUs, further growing the user base.

The Vera Rubin POD is expanding: Seven Chips, Five Rack-scale Systems

One of the major reveals at this year’s conference and worth re-emphasizing is the addition of the Groq LPU to speed up AI inference and the addition of co-packaged optics for the scale network. The NVIDIA AI factory is built around five rack types, and a full Vera Rubin POD “features 40 racks, 1.2 quadrillion transistors, nearly 20,000 NVIDIA dies, 1,152 NVIDIA Rubin GPUs, 60 exaflops, and 10 PB/s total scale-up bandwidth”²

The VR NVL72 GPU node
The newly announced companion Groq LPU rack offloading part of the AI inference pass (decode)
BlueField-4 to store KV cache offloaded from the GPU memory
Vera CPU Rack for more general Agentic workloads and RL, and
the Spectrum-6 networking rack to connect the whole POD.

Summary of the Keynote by section

Here’s a short breakdown of the main section Jensen covered in the Keynote.

Duration	Section
16 min	Intro, Cuda flywheel, Graphics improvements — Celebrating Cuda’s 20y anniversary and showing DLSS5 graphics improvements
22 min	Accelerated Analytics — Emphasizing NVIDIA’s role in accelerating enterprise analytics and many of the CSP’s AI offerings in the agentic era
7 min	Cuda-X review and AI native companies — Reviewing the library ecosystem that forms CUDA-X
22 min	AI Inference Inflection + Datacenter efficiency overview — Discussing the AI inference inflection point and how CEO’s will be evaluating their agentic companies
38 min	Full Vera Rubin hardware stack + DSX platform — Showing Vera Rubin + Groq hardware and explaining how they improve the throughput vs. interactivity performance curves
19 min	OpenClaw, NemoClaw, Open Model Coalition — Praising the explosive growth of OpenClaw as a revolutionary moment, and announcing NVIDIA’s enterprise reference NemoClaw and the open model coalition
14 min	Robotics, Physical AI, & recap — Describing the evolution of physical AI and the robotic landscape and recaping with a specially generated music video

Find the breakdown below, linking directly into each section on the YouTube video, along with summary notes and section durations.

Intro, Cuda flywheel, Graphics improvements (16min)

Tokens, the Building Blocks of AI · 3:15min

Keynotes start with an inspiring video describing how AI tokens are the main "commodity" produced by AI factories and their power to unlock new knowledge and possibilities

Welcome to GTC 2026 · 2:47min

Jensen enters the stage and gives introductory remarks thanking the pre-game show hosts, and also how the conference will be covering the AI 5 layer cake, a reference to his blog post that divides the stack along: Energy, Chips, Infrastructure, Models, and Applications

20 Years of CUDA · 4:21min

Jensen reviews the flywheel that Cuda software has been enabling for the past 20 years.

GeForce · 3:27min

CUDA made GPUs programmable first on the consumer product GeForce in 2006, which then enabled the deep learning community to test the viability of training neural networks and launched the new AI revolution.

DLSS 5 · 2:29min

Jensen shows a video featuring the new DLSS5 capability, a Neural rendering technology that fuses 3d Graphics with AI to give more beautiful and detailed textures to videos. Video details triggered a backlash from game developers.

Accelerated Analytics (22min)

Structured Data is the Ground Truth of AI · 3:26min

Jensen says Analytics are ripe for acceleration with the arrival of AI agent and emphasizes CuDF and CuVS as foundation libraries powering the whole ecosystem.

IBM Reinvents Data Processing With NVIDIA · 18:10min

He announced partnerships with IBM for Watson-X, a major contributed to open source Presto C++ and user of Spark over Rapids, NVIDIA's own accelerated dataframe libraries. Also announced were partnerships with Dell for an AI platform over RTX6000 servers, and for Google Cloud's AI Hypercomputer. Jensen highlights NVIDIA's stack that accelerate many of the CSP's offerings for AI and he spent some time reviewing them for different cloud providers.

Cuda-X review and AI native companies (7min)

NVIDIA Foundational Technology Montage · 4:44min

Jensen does a quick review of the list of cuda-x libraries and shows a video simulation of these libraries at work

AI Natives · 2:46min

The number of AI native companies has exploded in the past year with $150B VC investments. They all need token compute that NVIDIA can provide.

AI Inference Inflection + Overview of datacenter efficiency (Tokens/Watt) vs interactivity (Tokens/s per user) across different tiers (22min)

Inference Inflection Arrives · 4:42min

Jensen highlights 3 key moments for AI inference in the past 2 years: 2023) ChatGPT is released 2024) reasoning AI model with o1 and o3 takeoff and in 2025) Claude code agentic system revolutionizes software engineering.

"The inflection point for inference has arrived." · 1:40min

Agent thinking capabilities led to an explosion in the amount of inference by 10,000x since ChatGPT was released. Coupled with 100x increase in end-user demand, Jensen says we have 1M x more inference demand since 2023. We are now at an inflection point for inference

Inference Inflection Drives Strong Growth · 8:30min

Last year Jensen saw $500B demand for blackwell. This year through 2027, he see $1Tr in infrastructure investments on NVIDIA mainly for inference. 60% of the business is for hyperscalers (some of it for internal use), and 40% is all the rest, such as regional or sovereign cloud, enterprise, supercomputers and all the rest. GB + NVL72 + inference over fp4 for training , dynamo, tensorRT. DGX Cloud.

NVIDIA Extreme Co-Design Revolutionized Token Cost · 3:57min

Datacenters are constrained by a fixed amount of power (Watts) available. Emphasize Tokens Per Watt as the metric to maximize, and interactivity (token/s per User) as a use case differentiator.

InferenceMAX King · 1:23min

Shows how GB300NVL72 has improved on both efficiency and cost for inference and has been recognized by semianalysis as inference King!

NVIDIA is the Global Standard for AI Inference at Scale · 0:33min

Inference service providers should be seen as token factories. The output token rate from companies like eigen AI, together.ai, nebius, etc. has increased very fast, now reaching 400+ token/s for kimiK2.5 reasoning agent. Also see artificial analysis for a breakdown between providers.

AI Factories are the Industrial Infrastructure of the AI Era · 1:10min

Inference drives revenues and Token effectiveness is the most important metric.

Full Vera Rubin hardware stack — GPU, NVLink, Rubin Ultra, and Spectrum-X Groq LPX + DSX platform for AI factory optimization (38min)

A Decade of AI Infrastructure Innovation: From DGX-1 to Vera Rubin · 3:30min

Jensen narrates NVIDIA's decade of data center infrastructure innovation:

2016

DGX-1 — packages 8 Pascal GPUs, first supercomputer built for deep learning, one delivered to OpenAI that year

2017

Volta — introduces NVLink 2 switch, GPU-to-GPU interconnect inside nodes

2019

Mellanox acquisition — allows the data center to become a single unit of computing

2020

Ampere / DGX A100 SuperPOD — brings scale-up via NVLink 3, scale-out via ConnectX-6 InfiniBand

2022

Hopper — supports FP8 Transformer Engine for Gen AI, NVLink 4, ConnectX-7

2024

Blackwell / NVL72 — achieves 130 TB/s bandwidth and a deeper rack-level co-design for top performance

2026

Vera Rubin — built for agentic AI · 35× throughput/MW · 40M× cumulative compute over the decade

NVIDIA Vera Rubin · 2:27min

Jensen introduces the Vera Rubin hardware on stage

NVIDIA Vera Rubin, NVLink and Groq · 1:36min

He makes some interesting observations: with the recent tray designs, installation time falls down from 2 days to 2 hours. Also cooling is done with hot water at 45 degrees.

Spectrum-X Switch, Co-Packaged Optics, Vera and BlueField-4 · 2:09min

discusses the 8 grok 3rd gen tray which is in production and shows the Spectrum Co-packaged optics switch. Vera brings 2x performance per watt. ConnectX9 and storage platform are powered by Vera CPU.

Rubin Ultra · 2:03min

Jensen also shows VR Ultra and the new Kyber rack that can connect 144 gpus that now slide vertically into the rack. He also shows the new NVLink tray design that sits behind, also vertically.

Inference Performance and Efficiency Drive Company Results · 9:35min

Uniting Processors of Extreme Performances · 3:36min

NVIDIA Groq 3 LPX · 0:38min

Jensen shows Groq LPX manufactured by samsung and say he expects to ship by Q3 this year.

Announcing NVIDIA Launch Partners · 1:56min

shows all the AI labs, cloud, and OEM/ODM that will launch Vera Rubin. Expects production in the 1000s per week. also shows launch partners for Vera CPU and BlueField storage systems

NVIDIA Vera Rubin: 7 Chips – 5 Rack Systems · 1:02min

NVIDIA Extreme Co-Design Delivering X-Factors Every Year · 3:37min

shows the roadmap to 2028 with Feynman. Oberon will enable scale up in both copper and optical to support NVL576 racks (Kyber) and then NVL1152 for Feynman with Kyber.

NVIDIA DSX AI Factory Platform · 2:10min

How AI Factories Maximize Tokens, Power, and Profit With NVIDIA DSX · 3:25min

The video summarizes all the components of the DSX AI factory platform

Space-1 Vera Rubin Module · 0:43min

Jensen briefly mentions NVIDIA's foray in space with Space-1 Vera Rubin module and mentions the challenge of cooling in space.

OpenClaw, NemoClaw, Open Model Coalition (19min)

NemoClaw for OpenClaw · 1:24min

Jensen is very excited about OpenClaw, the most popular open source in history, with the fastest project to get the most stars in github

OpenClaw: The ChatGPT Moment for Long-Running, Autonomous Agents · 9:14min

He shows how openclaw grew as a project to 340k stars on GitHub since the end of january 2026. It is the operating system of agents and every enterprise will soon need an OpenClaw strategy.

NVIDIA Nemotron and Open Models · 0:28min

How NVIDIA Open Models Power Every Industry's AI · 4:17min

The video shows models from each of the Nvidia families. They are world class, doing well on benchmarks. Shows nemotron-3-super-120b as #4 on best open model for openClaw. Nemotron 3 ultra.

Announcing Global AI Leaders Join NVIDIA Nemotron Coalition · 2:57min

Jensen announces the NVIDIA Nemotron Coalition³ aimed at accelerating the co-development of open AI frontier models with partners Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam and Thinking Machines Lab

Announcing NVIDIA NemoClaw Reference OpenClaw · 0:39min

Robotics, Physical AI, & recap (14min)

Physical AI and Robotics · 3:11min

The Age of Physical AI and Robotics · 4:27min

This video shows how autonomous cars have been improving thanks to NVIDIA's and partner ecosystem.

Olaf Takes the Stage With Jensen Huang · 1:55min

Jensen welcomes the only guest at the keynote. Last year it was a Star Wars inspired robot "blue", this year it is Olaf from Frozen

Official Keynote Closing Video · 4:02min

Full keynote is available here and the slides here.

References

Semianalysis — Nvidia: The Inference Kingdom Expands — newsletter.semianalysis.com
NVIDIA — Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer — developer.nvidia.com
NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models — nvidianews.nvidia.com
NVIDIA DRIVE Hyperion Achieves Level 4 Autonomy with Uber Partnership — nvidianews.nvidia.com

GTC 2026 Keynote — Part 2: Intro, Analytics, CUDA-X & Inference

2026-04-03T07:00:00+00:00

This is Part 2 of a 3-part breakdown of the GTC 2026 keynote. Start with Part 1: Overview & Context or jump to Part 3: Vera Rubin Hardware, OpenClaw & Robotics. The single-page version is also available.

Previously in Part 1: I covered the conference’s atmosphere, shared a bit about the keynote’s energy, NVIDIA’s celebration of CUDA’s 20th anniversary and the flywheel it has created, and how the introduction of the new Groq rack expended NIVDIA’s AI Factory Pod, now a five-rack system combining the Groq LPX, BlueField-4, Vera CPU, and Spectrum-6 networking racks alongside the Vera Rubin GPU node.

Summary of Part 2 sections

Here’s a short breakdown of the first hour of the Keynote. For each of the section i give how much time Jensen spent on it along with my impressions and summary notes. I also link directly into each section on the YouTube video.

Duration	Section
16 min	Intro, Cuda flywheel, Graphics improvements — Celebrating Cuda’s 20y anniversary and showing DLSS5 graphics improvements
22 min	Accelerated Analytics — Emphasizing NVIDIA’s role in accelerating enterprise analytics and many of the CSP’s AI offerings in the agentic era
7 min	Cuda-X review and AI native companies — Reviewing the library ecosystem that forms CUDA-X
22 min	AI Inference Inflection + Datacenter efficiency overview — Discussing the AI inference inflection point and how CEO’s will be evaluating their agentic companies

Intro, Cuda flywheel, Graphics improvements (16min)

Tokens, the Building Blocks of AI · 3:15min

The keynote starts with an inspiring video describing how AI tokens are the main "commodity" produced by AI factories and their power to unlock new knowledge and possibilities

Welcome to GTC 2026 · 2:47min

20 Years of CUDA · 4:21min

Jensen reviews the flywheel that Cuda software has been enabling for the past 20 years.

GeForce · 3:27min

DLSS 5 · 2:29min

Accelerated Analytics (22min)

Structured Data is the Ground Truth of AI · 3:26min

Jensen says Analytics are ripe for acceleration with the arrival of AI agent and emphasizes CuDF and CuVS as foundation libraries powering the whole ecosystem.

IBM Reinvents Data Processing With NVIDIA · 18:10min

Cuda-X review and AI native companies (7min)

NVIDIA Foundational Technology Montage · 4:44min

Jensen does a quick review of the list of cuda-x libraries and shows a video simulation of these libraries at work

AI Natives · 2:46min

The number of AI native companies has exploded in the past year with $150B VC investments. They all need token compute that NVIDIA can provide.

AI Inference Inflection + Overview of datacenter efficiency (Tokens/Watt) vs interactivity (Tokens/s per user) across different tiers (22min)

Inference Inflection Arrives · 4:42min

"The inflection point for inference has arrived." · 1:40min

Inference Inflection Drives Strong Growth · 8:30min

NVIDIA Extreme Co-Design Revolutionized Token Cost · 3:57min

Datacenters are constrained by a fixed amount of power (Watts) available. Emphasize Tokens Per Watt as the metric to maximize, and interactivity (token/s per User) as a use case differentiator.

InferenceMAX King · 1:23min

Shows how GB300NVL72 has improved on both efficiency and cost for inference and has been recognized by semianalysis as inference King!

NVIDIA is the Global Standard for AI Inference at Scale · 0:33min

AI Factories are the Industrial Infrastructure of the AI Era · 1:10min

Inference drives revenues and Token effectiveness is the most important metric.

The first hour of the keynote established the foundations: CUDA’s flywheel, NVIDIA’s growing role in enterprise analytics, and the massive scale of the inference inflection. Part 3 shifts to the hardware itself where Jensen walks through the full Vera Rubin stack with Groq, then turns to what he called one of the most important open source moments in history.

← Part 1: Overview & Context · Part 3: Vera Rubin Hardware, OpenClaw & Robotics →

GTC 2026 Keynote — Part 1: Overview & Context

2026-04-01T07:00:00+00:00

This is Part 1 of a 3-part breakdown of the GTC 2026 keynote. Jump to Part 2: Intro, Analytics, CUDA-X & Inference or Part 3: Vera Rubin Hardware, OpenClaw & Robotics. The single-page version is also available.

I was back this year for the 2026 edition of NVIDIA’s GTC conference held at the San Jose Convention Center and surroundings from March 16-19.

In this post, I will only cover the keynote and will delve into the sessions I attended and the exhibit hall in followup posts.

The Keynote

Jensen reveled in being crowned “inference king” by Semianalysis for GB NVL72 system! Also check their review¹ of the GTC conference.

CUDA is 20 years old

The Vera Rubin POD is expanding: Seven Chips, Five Rack-scale Systems

The VR NVL72 GPU node
The newly announced companion Groq LPU rack offloading part of the AI inference pass (decode)
BlueField-4 to store KV cache offloaded from the GPU memory
Vera CPU Rack for more general Agentic workloads and RL, and
the Spectrum-6 networking rack to connect the whole POD.

With the stage now set (the packed keynote venue, Jensen’s excitement about the expanding CUDA ecosystem, and the broad strokes of the new Vera Rubin POD), Part 2 dives into the actual keynote sections, starting with the CUDA anniversary, accelerated analytics, and Jensen’s case for the AI inference inflection point.

Part 2: Intro, Analytics, CUDA-X & Inference →

References

Semianalysis - Nvidia – The Inference Kingdom Expands — https://newsletter.semianalysis.com/p/nvidia-the-inference-kingdom-expands ↩
NVIDIA - Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer - https://developer.nvidia.com/blog/nvidia-vera-rubin-pod-seven-chips-five-rack-scale-systems-one-ai-supercomputer/ ↩

GPU vs CPU for In-Memory Analytics: Bandwidth Holds as Compute and Cost Advantages Narrow Across Three Generations

2026-03-25T07:00:00+00:00

One of the central arguments for GPU-accelerated analytics is that GPU hardware is advancing faster than server CPUs. But for analytics workloads, the outcome depends on more than raw compute: memory bandwidth, memory capacity, cost, and power efficiency all matter. This post examines three generations of NVIDIA CPU-GPU superchips against the best contemporary AMD CPUs across three lenses: raw compute parity, a $1M bare-metal capital budget, and equal hourly spend on AWS cloud instances.

Scope: This analysis applies to in-memory analytics — workloads whose active dataset fits within the system’s fast memory tier (HBM for GPUs, DRAM for CPUs). Once a workload spills to storage or a slower memory tier, the bandwidth and capacity comparisons change fundamentally: GPU HBM bandwidth advantages disappear when the bottleneck shifts to PCIe, NVMe, or network I/O, and CPU DRAM’s larger capacity becomes a decisive structural advantage. The conclusions here do not generalize to disk-spilling or out-of-core workloads.

The findings are consistent across all three views. GPU memory bandwidth is the most durable advantage — it crossed above parity between the GH200 and GB200 generations and holds steady at 3.5–5.4× at equal spend, whether bare-metal or cloud. The compute and cost advantages that originally drove GPU adoption are compressing: GPU prices are rising faster than per-chip compute gains, and the Perf/W lead is narrowing in parallel. The capacity gap between cheap DDR DRAM and expensive HBM collapses dramatically at equal budget — from 51–91× at compute parity to 8–11× at equal spend.

Table of Contents

The CPU Baseline
GPU Superchip Specifications
Study #1: NVIDIA GPU vs AMD CPU at Compute Parity
Study #2: Isocost Analysis - Bare Metal: What Does $1M of GPU Buy vs $1M of CPU?
Study #3: Isocost Analysis - Cloud Instance: GPU vs CPU at Equal Hourly Spend on AWS
Conclusion
Reference Tables

Methodology Caveat: This analysis is intentionally simplified and scoped to in-memory workloads — datasets that fit within the fast memory tier of each system. Real platform evaluation spans many additional dimensions – workload mix, software maturity, interconnect topology, memory tiering behavior, cluster-level networking, availability, and total cost of ownership over time. The comparisons here use a compute-parity model plus explicit assumptions (especially for cost and power) to make directional trends easier to inspect, not to claim a universally optimal chipset choice.

TL;DR

GPU bandwidth crossed the parity threshold between GH200 and GB200 — from trailing the Genoa cluster (~0.56×) to leading it (~1.3×), then widening to ~2.1× with VR200. The inflection happens in one generation.
GPU bandwidth per dollar is the most stable metric across all three lenses — at equal bare-metal spend it holds at 4.3–5.4× across three generations; at equal cloud spend it holds at 3.5–3.9× across three AWS generations (Ampere through Blackwell). Unlike compute, it does not compress.
GPU compute and cost advantages are compressing — the FP32 advantage at equal spend falls from ~5.4× (GH200 vs Milan, $1M) to ~2.5× (VR200 vs Turin), and from ~9.5× (H100 vs Genoa, AWS) to ~5.2× (B200 vs Turin). The H100 generation was the peak: it delivered a higher compute advantage per dollar than either the A100 era (~4.5×) before it or the B200 era after it. GPU prices are rising faster than per-chip compute gains, and the Perf/W lead is narrowing for the same reason.
The capacity gap is structural but not fixed — CPU DRAM holds 51–91× more memory at compute parity, but that collapses to 8–11× at equal spend (both bare-metal and cloud). The difference is pricing, not technology: DDR is cheap per GB; HBM is not.
Neither side dominates across all axes — bandwidth now decisively favors GPUs; compute favors GPUs but compressingly so; capacity favors CPUs at any scale; cost and Perf/W advantages are narrowing. Real workloads still decide the winner.

This is a companion post to The Case for GPU-Accelerated Data Analytics.

The CPU Baseline

Server CPUs from Intel and AMD have seen real but incremental progress over the same period. AMD EPYC has been the more aggressive of the two — Turin (2024) tripled memory bandwidth relative to Milan (2021) by upgrading from 8-channel DDR4-3200 (~205 GB/s) to 12-channel DDR5-6000 (~576 GB/s), while also tripling the core count to 192 (up from 64 in Milan).

In the tables below, FP32 numbers are theoretical peak using the widest SIMD available per generation, as a rough proxy for analytics workload compute. Actual sustained throughput varies with workload, instruction mix, and all-core clock. Peak FP32 is calculated without FMA-doubling to reflect typical analytics SQL, which rarely relies on fused multiply-add operations:

Peak FP32 = (Total SIMD units × (SIMD width ÷ 32) × Clock GHz) ÷ 1000

E.g., for Turin: 384 units × (256 / 32) × 2.4 GHz ÷ 1000 ≈ 7.4 TFLOPS. Most analytics operations are comparisons, aggregations, and reductions—not multiply-accumulate patterns. (FMA-doubling would apply to dense linear algebra or ML kernels, not analytics, so it would be misleading here.)

AMD EPYC (per socket)

	Milan (3rd Gen, 2021)¹	Genoa (4th Gen, 2022)²	Turin (5th Gen, 2024)³
Max cores	64	96	192
Memory bandwidth	~205 GB/s (8-ch DDR4-3200)	~461 GB/s (12-ch DDR5-4800)	~576 GB/s (12-ch DDR5-6000)
Best SIMD	AVX2 (256-bit)	AVX-512 (512-bit)	AVX-512 (512-bit)
AVX FP units	2×256-bit/core (128 total)	2×256-bit/core fused→512-bit (192 total)	2×256-bit/core fused→512-bit (384 total)
Peak FP32 (best SIMD) at 2.45 GHz	~2.5 TFLOPS	~3.7 TFLOPS	~7.4 TFLOPS

Intel’s Xeon gains tell a two-part story. Within the Xeon Scalable lineage, progress was incremental: Emerald Rapids (2024) lifted bandwidth only ~1.2× over Sapphire Rapids — from ~307 GB/s to ~358 GB/s (both 8-ch DDR5) — while core count barely moved from 60 to 64 (+7%). The more significant step was Xeon 6 with Granite Rapids (also 2024), a new platform that doubled max cores to 128, pushed bandwidth to ~409 GB/s (8-ch DDR5-6400), and nearly tripled FP32 compute to ~10 TFLOPS. Still well below AMD’s Turin on bandwidth, but a meaningful inflection within Intel’s own trajectory.

Intel Xeon Platinum (per socket)

	Sapphire Rapids (4th Gen, 2023)⁴	Emerald Rapids (5th Gen, 2024)⁵	Xeon 6 / Granite Rapids (2024)⁶
Max cores	60	64	128
Memory bandwidth	~307 GB/s (8-ch DDR5-4800)	~358 GB/s (8-ch DDR5-5600)	~409 GB/s (8-ch DDR5-6400)
Best SIMD	AVX-512 (512-bit)	AVX-512 (512-bit)	AVX-512 (512-bit)
AVX FP units	2×512-bit/core (120 total)	2×512-bit/core (128 total)	2×512-bit/core (256 total)
Peak FP32 (best SIMD) at 2.5GHz	~4.8 TFLOPS	~5.1 TFLOPS	~10 TFLOPS

GPU Superchip Specifications

Looking at NVIDIA’s flagship data-center CPU-GPU superchips across three recent generations, each roughly one to two years apart:

	GH200 (Grace Hopper)⁷	GB200 (Grace Blackwell)⁸	VR200 (Vera Rubin)⁹
Superchip Configuration	1x Grace CPU + 1x H200 GPU	1x Grace CPU + 2x B200 GPUs	1x Vera CPU + 2x R100 GPUs
GPU Device Memory (HBM)	144 GB HBM3e	384 GB HBM3e (192 GB × 2)	576 GB HBM4 (288 GB × 2)
CPU Host Memory (LPDDR5X)	480 GB	Up to 480 GB	Up to 1.5 TB
Total Unified Memory (Host + Device)	624 GB	Up to 864 GB	Up to 2.1 TB
GPU Memory Bandwidth	4.9 TB/s	16 TB/s (8 TB/s × 2)	44 TB/s (22 TB/s × 2)
FP32 Compute	67 TFLOPS	150 TFLOPS	260 TFLOPS
CPU-to-GPU Interconnect	NVLink-C2C (900 GB/s)	NVLink-C2C (900 GB/s)	NVLink-C2C (1.8 TB/s)

Summary:

Memory bandwidth has grown roughly 9× across three superchip generations: from 4.9 TB/s on the GH200 to 44 TB/s on the VR200.
HBM capacity has grown 4× over the same span, from 144 GB to 576 GB.
The NVLink-C2C architecture further extends this by exposing unified memory that spans both HBM and LPDDR5X — the VR200 makes up to 2.1 TB (576 GB HBM4 + 1.5 TB LPDDR5X) accessible to the GPU.
That said, HBM remains roughly an order of magnitude more expensive per gigabyte than DDR5, and for workloads that spill beyond the fast HBM tier, performance falls back on the lower LPDDR5X bandwidth.

Study #1: NVIDIA GPU vs AMD CPU at Compute Parity

Raw compute capability is stark, and growing with each GPU generation, but compute alone does not determine analytics outcomes. The table below shows, for each superchip, the best contemporary AMD CPU, how many sockets are needed to match the GPU’s FP32 throughput, and how that compute-equivalent cluster compares on bandwidth, capacity, cost, and power efficiency.

GPU Superchip	GH200 (Grace Hopper)	GB200 (Grace Blackwell)	VR200 (Vera Rubin)
Top AMD CPU	EPYC 9654 (Genoa, Zen 4)²	EPYC 9965 (Turin, Zen 5)³	EPYC 9965 (Turin, Zen 5)³
Cores / socket	96	192	192
CPU bandwidth / socket	~461 GB/s (12-ch DDR5-4800)	~576 GB/s (12-ch DDR5-6000)	~576 GB/s (12-ch DDR5-6000)
AMD FP32 / socket	~3.7 TFLOPS	~7.4 TFLOPS	~7.4 TFLOPS
GPU FP32**	67 TFLOPS	150 TFLOPS	260 TFLOPS
Sockets for FP32 parity with GPU	~19 sockets (~10 nodes)	~21 sockets (~11 nodes)	~36 sockets (~18 nodes)

CPU cluster bandwidth	~8.8 TB/s	~12.1 TB/s	~20.7 TB/s
GPU HBM bandwidth	4.9 TB/s	16 TB/s	44 TB/s
GPU vs CPU bandwidth (higher is better for GPU)	🔴 CPU ~1.8× ahead	🟢 GPU ~1.3× ahead	🟢🟢 GPU ~2.1× ahead

CPU cluster DRAM	~57 TB	~63 TB	~108 TB
GPU total memory	624 GB	864 GB	2.1 TB
CPU vs GPU capacity (lower is better for GPU)	🔴🔴 CPU ~91× more	🔴 CPU ~73× more	🟡 CPU ~51× more (gap narrowing)

Est. GPU single-chip cost	~$34k - $44k	~$80k - $95k	~$153k - $222k
Est. CPU cost	~$0.48M - $0.76M	~$0.53M - $0.84M	~$0.90M - $1.44M
CPU vs GPU cost ratio (higher is better for GPU)	🟢🟢 ~15.9× CPU	🟢 ~8.0× CPU	🟢 ~6.8× CPU

GPU Power for Parity	~1.0 kW** (GH200)	~2.7 kW (GB200)	~5.0 kW (VR200)
CPU System Power	~6.8 kW	~10.5 kW	~18.0 kW
GPU vs CPU Perf/W efficiency	🟢🟢 ~6.8×	🟢 ~3.9×	🟢 ~3.6×

Inter-node (shuffle) bandwidth is not shown here because NVL32/NVL72 are rack-scale products, individual superchips are not sold as standalone IB nodes. Within the rack, all superchip-to-superchip traffic flows over NVLink/NVSwitch at very high bandwidth; InfiniBand only exits at the rack boundary. Inter-node comparisons are covered in Study #2 and Study #3, where rack-level deployment makes the unit of comparison clearer.

Cost assumptions use a platform-normalized method (cost per platform ÷ superchips per platform) with current market prices for full-rack CAPEX:

GH200 (Hopper) Platform: $1.1M – $1.4M (NVL32 Rack). At 32 superchips, this yields ~$34k – $44k per equivalent.
GB200 (Blackwell) Platform: $2.9M – $3.4M (NVL72 Rack). At 36 superchips, this yields ~$80k – $95k per equivalent.
VR200 (Rubin) Platform: $5.5M – $8.0M (NVL72 Rack). At 36 superchips, this yields ~$153k – $222k per equivalent.

The Gap: Bandwidth, Capacity, Cost, and Perf/W at FP32 Parity (log scale)

CPU cluster sized to match GPU FP32 compute at each generation. Trends Analysis: Bandwidth rising = GPU gaining. Capacity falling = CPU advantage shrinking. Cost falling = GPU's cost advantage over CPU shrinking (bad for GPU). Perf/W falling = GPU's power efficiency lead over CPU shrinking (bad for GPU).

Two narratives emerge from the data, one where GPUs are clearly gaining ground, and one where their traditional advantages are quietly eroding.

1) GPUs are closing the gap on Memory Metrics

On Memory Bandwidth, the inflection point where GPU begins to outrun a compute-equivalent CPU cluster falls between the GH200 and GB200 generations. In the GH200 era, the Genoa cluster is actually ahead (~1.8× CPU advantage). By GB200, the GPU moves ahead (~1.3×). By VR200, the GPU lead widens further (~2.1×).
On Memory Capacity, the direction is also positive for GPU: DDR-based clusters remain ~51–91× ahead on raw memory footprint, but the ratio is declining each generation. DDR is orders of magnitude cheaper per gigabyte than HBM, so this gap won’t close quickly — but it is shrinking. For workloads that spill beyond the HBM tier, the GPU must fall back to LPDDR5X unified memory or GPUDirect Storage; spill support is therefore a required capability for any GPU database aiming to compete at scale.

2) GPU’s traditional advantages are narrowing (negative for GPU) on cost and energy efficiency

On Cost at FP32 Parity, the CPU-to-GPU cost ratio trends down from ~15.9× (GH200) to ~6.8× (VR200). A falling cost line means each successive GPU generation requires a larger single-chip capital outlay to deliver the same parity compute, eroding the hardware cost advantage that originally made GPU deployments attractive. These cost figures are the most assumption-sensitive inputs in the post.
On Power Efficiency, the GPU’s Perf/W lead is also shrinking — from nearly 6.8× over a Genoa cluster to ~3.6× over a Turin cluster. NVIDIA is pushing the thermal limits of silicon — the VR200 superchip draws ~5 kW total, versus ~1 kW for the GH200 — to extract raw performance, while server CPUs have maintained a more conservative power envelope. The absolute efficiency advantage remains meaningful, but the trend is moving in the wrong direction for GPU advocates.

Study #2: Isocost Analysis - Bare Metal: What Does $1M of GPU Buy vs $1M of CPU?

The compute-parity table above asks how many CPUs does it take to match one GPU in raw FLOPS? An isocost analysis flips the question: for a fixed $1M capital budget, how many GPU superchips vs CPU sockets can you buy — and what do you get?

$1M is a meaningful procurement anchor: it buys nearly a full Hopper NVL32 rack worth of GH200s, a partial Blackwell rack of GB200s, or about five Vera Rubin superchips. On the CPU side, $1M buys a meaningful compute cluster — 125 Milan sockets (~62 nodes) or 71 Turin sockets (~36 nodes). This budget is large enough that multi-chip GPU NVLink effects start to matter, and realistic enough to represent a real infrastructure decision.

CPU socket prices are estimated market rates for the highest-core-count SKU at each generation — EPYC 7763 (Milan) at approximately $8k/socket and EPYC 9965 (Turin) at approximately $14k/socket. These are chip-level prices and do not include platform, memory, or networking, consistent with comparing silicon to silicon. GPU costs use the same rack-normalized per-superchip midpoints from the parity section above.

GPU	GH200	GB200	VR200
CPU	Milan	Turin	Turin
GPU price / superchip	~$39k	~$87.5k	~$187.5k
CPU price / socket	~$8k (Milan)	~$14k (Turin)	~$14k (Turin)
$1M GPU fleet	~25 GH200 superchips	~11 GB200 superchips	~5 VR200 superchips
$1M CPU fleet	~125 Milan sockets (~62 nodes)	~71 Turin sockets (~36 nodes)	~71 Turin sockets (~36 nodes)
GPU FP32 ($1M fleet)	~1,675 TFLOPS	~1,650 TFLOPS	~1,300 TFLOPS
CPU FP32 ($1M fleet)	~313 TFLOPS	~526 TFLOPS	~526 TFLOPS
GPU FP32 advantage	🟢🟢 ~5.4×	🟢 ~3.1×	🟢 ~2.5×

GPU HBM bandwidth (intra-node, $1M fleet)	~122.5 TB/s	~176 TB/s	~220 TB/s
CPU DDR bandwidth (intra-node, $1M fleet)	~25.6 TB/s	~40.9 TB/s	~40.9 TB/s
GPU HBM BW advantage (intra-node)	🟢🟢 ~4.8×	🟢🟢 ~4.3×	🟢🟢 ~5.4×

GPU inter-node BW (shuffle, per node)	400 Gbps / 50 GB/s (IB NDR)	400 Gbps / 50 GB/s (IB NDR)	1,600 Gbps / 200 GB/s (IB XDR, est.)
CPU inter-node BW (shuffle, per node)	200 Gbps / 25 GB/s (IB HDR)	400 Gbps / 50 GB/s (IB NDR)	400 Gbps / 50 GB/s (IB NDR)
GPU/CPU inter-node advantage (per node)	🟢 ~2×	🟡 ~1× (NDR parity)	🟢🟢 ~4× (XDR, est.)

GPU total memory ($1M fleet)	~15.6 TB	~9.5 TB	~10.5 TB
CPU DRAM ($1M fleet)	~125 TB	~106.5 TB	~106.5 TB
CPU capacity advantage	🔴 CPU ~8×	🔴 CPU ~11.2×	🔴 CPU ~10.1×

GPU fleet power	~25 kW	~29.7 kW	~25 kW
CPU fleet power	~45.5 kW	~46.2 kW	~46.2 kW
GPU Perf/W advantage	🟢🟢 ~9.7×	🟢🟢 ~4.9×	🟢🟢 ~4.6×

CPU bandwidth: sockets × per-socket bandwidth. CPU DRAM: 64 GB DIMMs, 2 per channel — ~1 TB/socket for Milan (8-ch DDR4), ~1.5 TB/socket for Turin (12-ch DDR5). CPU power adds ~30% platform overhead to socket TDP (Milan 280W, Turin 500W). GPU fleet sizes are partial racks: 25 GH200s ≈ 78% of an NVL32; 11 GB200s ≈ 30% of an NVL72; 5 VR200s ≈ 14% of an NVL72.

Isocost Comparison: GPU vs CPU Metrics at $1M Equal Spend (log scale)

$1M deployed into GPU superchips vs CPU sockets at each generation. FP32, Bandwidth, and Perf/W show the GPU fleet's advantage multiplier over the CPU fleet. Capacity shows the CPU's DRAM advantage over the GPU fleet's total memory.

Five patterns emerge from the $1M isocost view:

GPU compute advantage is meaningful but compresses as GPU prices rise. A $1M GH200 fleet delivers 5.4× more FP32 than a $1M Milan cluster. By VR200, that lead falls to 2.5× — not because GPU compute scaled down, but because $1M buys far fewer Vera Rubin superchips (5) than GH200s (25). This is a direct effect of GPU price inflation per generation outpacing the compute-per-chip gains.

GPU bandwidth advantage is the most stable metric: 4.3–5.4× across all three generations. Unlike the compute ratio, bandwidth per dollar holds remarkably steady. Even when buying fewer chips, each VR200 contributes 44 TB/s, which keeps the fleet aggregate well ahead of the CPU cluster. This is the GPU’s most durable advantage at equal budget: memory bandwidth per dollar has not eroded the way compute per dollar has.

The CPU capacity advantage is real but much smaller than the parity view suggests — and worsens for GPU in the GB200 generation. At parity, CPU clusters hold 51–91× more DRAM. At $1M, that collapses to 8–11×. However, the ratio worsens for GPU going from GH200 to GB200: $1M buys many more GH200s (15.6 TB total HBM) than GB200s (9.5 TB), because GB200s are ~2.2× more expensive per chip with only a proportional HBM-per-dollar increase. The VR200 partially recovers (10.5 TB) thanks to its larger per-chip HBM. As rack-scale NVLink pooling becomes the default deployment model, the effective addressable GPU memory pool expands beyond what these single-fleet numbers reflect.

GPU Perf/W advantage is large and consistent. The ~9.7× lead at the GH200 generation narrows to ~4.6× by VR200 — consistent with the parity trend — and reflects that GPU silicon extracts substantially more analytics-relevant FP32 output per watt than CPU silicon at this budget. Notably, both the GPU and CPU fleets draw comparable absolute power at $1M (~25–30 kW GPU vs ~45–46 kW CPU — a factor of ~1.5–1.8×), so the Perf/W ratio is primarily a statement about performance density, not a dramatic difference in total energy draw.

Inter-node (shuffle) bandwidth: per-node advantage recovers at VR200, but the CPU fleet still wins total aggregate egress at $1M. Bare-metal GPU and CPU clusters connect over InfiniBand — HDR 200 Gbps (25 GB/s) for Milan era, NDR 400 Gbps (50 GB/s) for Genoa, Turin, and GB200. The GH200 generation carries a 2× per-node advantage over a Milan cluster (NDR vs HDR). By GB200, both GPU and CPU nodes sit on NDR 400 Gbps — 1× per-node parity. The VR200, estimated to ship with ConnectX-9 (XDR), breaks this parity at 1,600 Gbps (200 GB/s) — a ~4× per-node advantage over Turin’s 400 Gbps (50 GB/s) NDR. At equal $1M spend, the CPU fleet’s larger node count still dominates total aggregate shuffle egress: 36 Turin nodes × 400 Gbps = 14.4 Tbps vs 11 GB200 nodes × 400 Gbps = 4.4 Tbps — a ~3.3× CPU aggregate advantage in the GB200 era. For VR200, the per-node XDR lead narrows the aggregate gap significantly: 5 VR200 nodes × 1,600 Gbps = 8 Tbps vs 36 Turin nodes × 400 Gbps = 14.4 Tbps — ~1.8× CPU aggregate advantage. For workloads dominated by cross-node data movement (large hash joins, high-cardinality group-by across partitions), the CPU fleet at bare-metal $1M scale retains the aggregate shuffle throughput edge, though VR200 narrows the gap significantly.

Study #3: Isocost Analysis - Cloud Instance: GPU vs CPU at Equal Hourly Spend on AWS

The capital budget analysis above captures bare-metal procurement economics. Cloud deployments shift this to an operational model — pay by the hour, no upfront commitment, scale up or down. This section uses AWS on-demand Linux pricing from Vantage (April 2026) to ask the same isocost question with hourly rates.

Cloud pricing caveat: On-demand AWS rates are the most widely published and comparable benchmark, but they are not the cheapest option. GPU-specialized clouds — CoreWeave, Lambda Labs, Crusoe, and others — typically offer H100 capacity at $2.49–2.89/hr per GPU (~$20–23/hr for an 8-GPU node), roughly 60% below the AWS p5.48xlarge rate of $55.04/hr. GCP and Azure on-demand rates for equivalent instances are broadly similar to AWS. All three advantage ratios in this section (FP32, bandwidth, capacity) are sensitive to pricing: a cheaper GPU cloud means more GPU instances per $1k/hr, which shifts all ratios in the GPU’s favor. The AWS numbers here should be read as a specific pricing scenario, not a hardware-fundamental result.

A key property of cloud GPU instances: the instance price already includes the host CPU. The p4d.24xlarge bundles 8× A100 GPUs with an Intel Xeon Platinum host; the p5.48xlarge bundles 8× H100 GPUs with an AMD EPYC host; the p6-b200.48xlarge bundles 8× B200 GPUs with an Intel Xeon Emerald Rapids host. This is equivalent to the superchip pricing model — you pay for the full compute node, GPU and CPU together. AWS also offers p6e-gb200.36xlarge — 36 native Grace-Blackwell superchips — but on-demand pricing is not yet published for that instance.

At $1,000/hour on-demand:

CLOUD GPU	A100	H100	B200
CLOUD CPU	Milan	Genoa	Turin
GPU AWS Instances	p4d.24xlarge¹⁰	p5.48xlarge¹¹	p6-b200.48xlarge¹²
GPU $/hr per instance	$21.96	$55.04	$113.93
CPU AWS Instances	hpc6a¹³	hpc7a¹⁴	hpc8a¹⁵
CPU $/hr per instance	$2.88	$7.20	$7.92
Instances at $1k/hr	45 GPU / 347 CPU	18 GPU / 138 CPU	8 GPU / 126 CPU
Total GPUs / CPU cores	360× A100 / ~33,300 Milan cores	144× H100 / ~26,500 Genoa cores	64× B200 / ~24,200 Turin cores
GPU vs CPU instance price ratio	🔴 7.6× more per GPU node	🔴 7.6× more per GPU node	🔴🔴 14.4× more per GPU node

GPU FP32 ($1k/hr fleet)	~7,020 TFLOPS	~9,650 TFLOPS	~4,800 TFLOPS
CPU FP32 ($1k/hr fleet)	~1,570 TFLOPS	~1,020 TFLOPS	~930 TFLOPS
GPU FP32 advantage	🟢🟢 ~4.5×	🟢🟢 ~9.5×	🟢🟢 ~5.2×

GPU HBM bandwidth (intra-node)	~560 TB/s (360 × 1.555 TB/s)	~483 TB/s (144 × 3.35 TB/s)	~512 TB/s (64 × 8 TB/s)
CPU mem bandwidth (intra-node)	~142 TB/s (347 × 410 GB/s)	~127 TB/s (138 × 922 GB/s)	~145 TB/s (126 × 1,152 GB/s)
GPU HBM BW advantage (intra-node)	🟢🟢 ~3.9×	🟢🟢 ~3.8×	🟢🟢 ~3.5×

GPU inter-node BW (shuffle, per node)	400 Gbps / 50 GB/s EFA	3,200 Gbps / 400 GB/s EFA	3,200 Gbps / 400 GB/s EFA
CPU inter-node BW (shuffle, per node)	100 Gbps / 12.5 GB/s EFA	300 Gbps / 37.5 GB/s EFA	300 Gbps / 37.5 GB/s EFA
GPU inter-node advantage (per node)	🟢 ~4×	🟢🟢 ~11×	🟢🟢 ~11×

GPU HBM capacity	~14.4 TB (45 × 320 GB)	~11.3 TB (18 × 640 GB)	~11.3 TB (8 × 1,440 GB)
CPU DRAM capacity	~133 TB (347 × 384 GiB)	~104 TB (138 × 768 GiB)	~95 TB (126 × 768 GiB)
CPU capacity advantage	🔴 CPU ~9×	🔴 CPU ~9×	🔴 CPU ~8×

FP32 estimates follow the same method as the parity section: SIMD units × (SIMD width ÷ 32) × all-core GHz. GPU FP32 uses ~19.5 TFLOPS per A100 SXM4 (NVIDIA published non-sparse FP32 peak), ~67 TFLOPS per H100, and ~75 TFLOPS per B200 (the GB200 superchip = 1 Grace CPU + 2× B200 GPUs = 150 TFLOPS total, so 75 TFLOPS per B200). CPU FP32 uses ~2.3 TFLOPS per 48-core Milan socket (EPYC 7R13: 96 SIMD units × 8 × 2.95 GHz / 1000), ~3.7 TFLOPS per 96-core Genoa socket, and ~3.7–4.0 TFLOPS per 96-core Turin socket at sustained all-core clock. CPU memory bandwidth uses per-socket figures (205 GB/s Milan, 461 GB/s Genoa, 576 GB/s Turin) multiplied by 2 sockets per instance (hpc6a/hpc7a/hpc8a are all 2-socket nodes), giving 410 GB/s, 922 GB/s, and 1,152 GB/s per instance respectively. GPU bandwidth specs are from NVIDIA datasheets; CPU bandwidth specs are from AMD datasheets. Prices are from Vantage.

Cloud Isocost: GPU vs CPU Metrics at $1k/hr Equal Spend on AWS (log scale)

AWS on-demand Linux pricing, April 2026. FP32 and BW show the GPU fleet's advantage multiplier over the CPU fleet at equal hourly spend. Capacity shows the CPU DRAM advantage over the GPU HBM fleet.

Four observations from the cloud view:

The compute advantage is large but generation-sensitive. At equal hourly spend, 18 H100 instances outcompute 138 Genoa nodes by ~9.5×. By the Blackwell generation, that lead narrows to ~5.2×: the B200 is more powerful per GPU, but $1,000/hr buys only 8 p6-b200 instances versus 126 hpc8a nodes — because the B200 instance price (2.1× the H100 instance price) has risen faster than the per-GPU compute improvement (~1.1×).

GPU bandwidth advantage is stable across all three cloud generations (~3.5–3.9×). From A100/Milan (3.9×) to H100/Genoa (3.8×) to B200/Turin (3.5×), total memory bandwidth per dollar holds remarkably steady even as the compute ratio swings from 4.5× to 9.5× and back to 5.2×. Bandwidth is the GPU’s most durable and predictable cloud advantage.

The cloud capacity gap (8–9× on AWS) is comparable to the bare-metal view (8–11×). This convergence reflects a similar GPU-to-CPU price ratio across both procurement models — on AWS, GPU instances cost roughly 7–14× more per node than CPU HPC instances, broadly in the same range as the bare-metal silicon cost ratios. The specific numbers will shift on cheaper clouds: at CoreWeave H100 pricing (~$20/hr per node), the same $1k/hr buys ~50 GPU nodes instead of 18, compressing the CPU capacity advantage to roughly 3×.

On AWS, GPU instances carry a decisive per-node inter-node bandwidth advantage that extends the GPU lead from intra-node memory to inter-node shuffle. The H100 (p5.48xlarge) and B200 (p6-b200.48xlarge) instances both carry 3,200 Gbps (400 GB/s) EFA — roughly 11× more per-node inter-node bandwidth than the 300 Gbps (37.5 GB/s) EFA on hpc7a (Genoa) and hpc8a (Turin). For analytics workloads with significant data movement between nodes (hash joins, group-by on high-cardinality keys), the GPU fleet’s per-node network bandwidth means it is also faster at the shuffle phase, unlike the bare-metal case where both GPU and CPU nodes share NDR InfiniBand. The A100 era is more modest: p4d.24xlarge has 400 Gbps (50 GB/s) EFA vs hpc6a’s 100 Gbps (12.5 GB/s) — a 4× per-node advantage. At $1k/hr, the H100 fleet’s total shuffle capacity (18 × 3,200 Gbps = 57.6 Tbps) also exceeds the Genoa fleet (138 × 300 Gbps = 41.4 Tbps); for the A100 and B200 eras the CPU fleet regains a total aggregate lead through node count, but the GPU maintains a 4–11× per-node inter-node advantage throughout all three cloud generations.

Conclusion

Three lenses — compute parity, $1M bare-metal capital, and equal hourly AWS spend — tell a consistent story with two competing narratives.

GPUs are gaining decisively on bandwidth. Bandwidth per chip crossed above the Genoa cluster between the GH200 and GB200 generations, and that lead has widened steadily. Bandwidth per dollar is the most stable metric across all isocost views: 4.3–5.4× on bare metal across three GPU generations, and 3.5–3.9× on AWS across three cloud generations (A100 through B200). Of every metric tracked in this post, memory bandwidth per dollar has eroded the least — and for analytics workloads, that matters most.

The compute and cost advantages that once justified GPU deployments are compressing. The FP32 advantage at equal spend falls from ~5.4× (GH200 vs Milan, $1M bare metal) to ~2.5× (VR200 vs Turin) — not because GPU compute stagnated, but because GPU prices are rising faster than per-chip compute gains. The cloud view shows a non-monotonic pattern: the A100/Milan era started at ~4.5×, the H100/Genoa generation peaked at ~9.5× (H100 offered exceptional value at launch — ~3.4× more compute than A100 without proportional pricing), and the B200/Turin shows compression to ~5.2× as instance prices have overtaken compute gains. The Perf/W lead is narrowing for the same reasons.

The capacity gap is structural but not fixed by price. At compute parity, CPU DRAM clusters hold 51–91× more memory than GPU HBM. At equal budget — whether $1M capital or $1k/hr cloud — that collapses to 8–11×. The gap is real, but much of it is a pricing artifact: DDR capacity is cheap per gigabyte; HBM is not. For workloads that genuinely need TBs of fast-path memory, CPUs hold a durable structural advantage. For most analytics workloads that fit in HBM, the gap is largely academic.

We will revisit this comparison when AMD EPYC Venice (Zen 6) specs are finalized, to measure how much the CPU side shifts the balance versus NVIDIA Vera Rubin.

Reference Tables

Cloud Instance Pricing — AWS on-demand Linux rates from Vantage, April 2026. GPU instances include the host CPU. Prices may vary by region.

Instance	GPU / CPU	$/hr (on-demand Linux)	Source
`p4d.24xlarge`	8× NVIDIA A100 SXM4 40GB + Intel Xeon Platinum 8275L	$21.96 (us-east-1)	🔗
`p5.48xlarge`	8× NVIDIA H100 SXM5 + AMD EPYC 7R13	$55.04 (us-east-1)	🔗
`p6-b200.48xlarge`	8× NVIDIA B200 + Intel Xeon Emerald Rapids	$113.93 (us-east-1)	🔗

`hpc6a.48xlarge`	2× AMD EPYC Milan (7R13, 48-core)	$2.88 (us-east-2)	🔗
`hpc7a.96xlarge`	2× AMD EPYC Genoa (9R14)	$7.20 (us-east-2)	🔗
`hpc8a.96xlarge`	2× AMD EPYC Turin (9R45)	$7.92 (us-east-2)	🔗

Why these instances?

GPU side — maximum GPU density per generation:
- p4d.24xlarge (8× A100), p5.48xlarge (8× H100), p6-b200.48xlarge (8× B200) are all the flagship 8-GPU nodes AWS offers per generation.
- Using maximum GPU density per node minimizes fixed CPU/networking overhead per GPU and is the standard choice for GPU-heavy workloads.
CPU side — HPC-optimized hpc* instances specifically:
- The hpc6a/7a/8a family is chosen over general compute instances (like c6a, c7a) because they match the AMD EPYC generations used in the bare-metal section (Milan → Genoa → Turin), and they include high-bandwidth EFA networking — making the networking comparison fair vs the GPU instances.
Generation alignment: Each pair is matched within the same GPU-era: A100 (2020) paired with Milan (2021), H100 (2022) with Genoa (2022), B200 (2024) with Turin (2024). This avoids comparing late-generation CPUs to early-generation GPUs or vice v

Memory Capacity and Bandwitdh - Inside a single chip

Instance	#GPUs / #CPU Cores	Total HBM/DRAM Capacity	Total Memory BW
`p4d.24xlarge` A100 ×8	8 GPUs / 96 vCPUs	320 GB HBM2e	~12.4 TB/s HBM
`p5.48xlarge` H100 ×8	8 GPUs / 192 vCPUs	640 GB HBM3	~26.8 TB/s HBM
`p6-b200.48xlarge` B200 ×8	8 GPUs / 192 vCPUs	1,440 GB HBM3e	~64 TB/s HBM

`hpc6a.48xlarge` Milan ×2s	— / 96 cores	384 GiB DDR4	~410 GB/s DDR4
`hpc7a.96xlarge` Genoa ×2s	— / 192 cores	768 GiB DDR5	~922 GB/s DDR5
`hpc8a.96xlarge` Turin ×2s	— / 192 cores	768 GiB DDR5	~1,152 GB/s DDR5

Interconnect Technologies — Intra-node bandwidth connects CPUs and GPUs within a single node (NVLink, PCIe, NUMA fabric). Inter-node bandwidth is the network fabric used for data exchange between nodes — the shuffle phase in distributed analytics.

  ①  Inter-chip BW          ②  Intra-node BW              ③  Inter-node BW

  GPU superchip (e.g. GB200 NVL72):
    ┌─── SoC ──────────┐
    │[CPU]──①──[GPU]   │──────────②──────────[GPU]  ···
    └──────────────────┘    NVLink / NVSwitch      │
         NVLink C2C                                └────③────[Node B]  ···
                                                        IB / EFA

  Cloud GPU node (e.g. p5.48xlarge):
    [Host CPU]──①──[GPU 0]──────②──────[GPU 1]  ···  [GPU 7]
                PCIe           NVLink              │
                                                   └────③────[Node B]  ···
                                                           EFA

  CPU cluster (e.g. hpc7a):
    (① n/a)   [Socket 0]──────②──────[Socket 1]
                          Inf. Fabric          │
                                               └────③────[Node B]  ···
                                                     IB / EFA

System	Type	Inter-chip BW	Intra-node BW	Inter-node BW (per node)
GH200 NVL32	GPU	NVLink C2C (~900 GB/s)	NVLink 4 (~900 GB/s/GPU)	IB NDR (~50 GB/s)
GB200 NVL72	GPU	NVLink C2C (~900 GB/s)	NVLink 5 NVSwitch (~1.8 TB/s/GPU)	IB NDR (~50 GB/s)
VR200 NVL72	GPU	NVLink C2C (~1.8 TB/s)	NVLink 6 (~3.6 TB/s/GPU, est.)	IB XDR / ConnectX-9 (~200 GB/s)

AMD Milan cluster	CPU	—	Infinity Fabric / NUMA (~200 GB/s cross-socket)	IB HDR (~25 GB/s)
AMD Genoa / Turin cluster	CPU	—	Infinity Fabric / NUMA (~250 GB/s cross-socket)	IB NDR (~50 GB/s)

`p4d.24xlarge` A100 ×8	GPU	PCIe 4.0 x16 (32 GB/s)	NVLink 3 bridge (~600 GB/s/GPU)	EFA2 (~50 GB/s)
`p5.48xlarge` H100 ×8	GPU	PCIe 5.0 x16 (64 GB/s)	NVLink 4 (900 GB/s/GPU)	EFA3 (~400 GB/s)
`p6-b200.48xlarge` B200 ×8	GPU	PCIe 5.0 x16 (64 GB/s)	NVLink 5 (1.8 TB/s/GPU)	EFA3 (~400 GB/s)

`hpc6a.48xlarge` Milan ×2s	CPU	—	Infinity Fabric / NUMA (~200 GB/s cross-socket)	EFA (~12.5 GB/s)
`hpc7a.96xlarge` Genoa ×2s	CPU	—	Infinity Fabric / NUMA (~250 GB/s cross-socket)	EFA (~37.5 GB/s)
`hpc8a.96xlarge` Turin ×2s	CPU	—	Infinity Fabric / NUMA (~250 GB/s cross-socket)	EFA (~37.5 GB/s)

AWS EFA speeds from Vantage, April 2026. InfiniBand speeds reflect standard configurations: HDR = 200 Gbps / 25 GB/s (2021 era), NDR = 400 Gbps / 50 GB/s (2022+). The jump from p4d (400 Gbps / 50 GB/s EFA) to p5 (3,200 Gbps / 400 GB/s EFA) reflects AWS’s EFA3 fabric generation deployed for the Hopper generation. NVIDIA NVLink and GPU intra-node specs from NVIDIA datasheets.

IB (InfiniBand) is an open industry-standard network fabric — completely independent of CPU vendor. NVIDIA owns Mellanox (acquired 2020), the dominant InfiniBand hardware maker, but the HCAs (host channel adapters) plug into any server via PCIe regardless of whether it runs AMD or Intel CPUs. Standard generations used in this post: HDR = 200 Gbps / 25 GB/s (circa 2021), NDR = 400 Gbps / 50 GB/s (2022+), XDR = 1,600 Gbps (200 GB/s)/port ConnectX-9 (2024+, est.).

EFA (Elastic Fabric Adapter) is AWS’s proprietary high-performance network interface for HPC and ML workloads. Unlike standard cloud networking, EFA uses OS-bypass RDMA (via AWS’s SRD — Scalable Reliable Datagram — protocol), skipping the kernel network stack to deliver lower latency and higher throughput for tightly-coupled distributed workloads. EFA is not InfiniBand, but achieves similar application-level semantics. Generations: EFA2 (~400 Gbps / ~50 GB/s, p4d) → EFA3 (~3,200 Gbps / ~400 GB/s, p5/p6-b200).

References

AMD EPYC 7763 (Milan) Processor — https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7763.html ↩
AMD EPYC 9654 (Genoa) Processor — https://www.amd.com/en/products/processors/server/epyc/9004-series/amd-epyc-9654.html ↩ ↩²
AMD EPYC 9965X (Turin) Processor — https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9965.html ↩ ↩² ↩³
Intel Xeon Platinum 8490H (Sapphire Rapids, 4th Gen Xeon Scalable) — https://www.intel.com/content/www/us/en/products/sku/231749/intel-xeon-platinum-8490h-processor-112-5m-cache-1-90-ghz/specifications.html ↩
Intel Xeon Platinum 8592+ (Emerald Rapids, 5th Gen Xeon Scalable) — https://www.intel.com/content/www/us/en/products/sku/237250/intel-xeon-platinum-8592-processor-320m-cache-1-90-ghz/specifications.html ↩
Intel Xeon 6980P (Granite Rapids, Xeon 6 with P-cores) — https://www.intel.com/content/www/us/en/products/sku/240785/intel-xeon-6980p-processor-504m-cache-2-00-ghz/specifications.html ↩
NVIDIA GH200 Grace Hopper Superchip — https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/ ↩
NVIDIA GB200 NVL72 (Grace Blackwell Superchip) — https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩
NVIDIA Vera Rubin NVL72 (Vera Rubin Superchip) — https://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/ ↩
AWS p4d.24xlarge (8× A100 SXM4 40GB + Intel Xeon Platinum 8275L) on-demand Linux — $21.96/hr in us-east-1 — https://instances.vantage.sh/aws/ec2/p4d.24xlarge ↩
AWS p5.48xlarge (8× H100 SXM5 + AMD EPYC host) on-demand Linux — $55.04/hr in us-east-1 — https://instances.vantage.sh/aws/ec2/p5.48xlarge ↩
AWS p6-b200.48xlarge (8× B200 + Intel Xeon Emerald Rapids) on-demand Linux — $113.93/hr in us-east-1 — https://instances.vantage.sh/aws/ec2/p6-b200.48xlarge ↩
AWS hpc6a.48xlarge (2-socket AMD EPYC Milan 7R13) on-demand Linux — $2.88/hr in us-east-2 — https://instances.vantage.sh/aws/ec2/hpc6a.48xlarge ↩
AWS hpc7a.96xlarge (2-socket AMD EPYC Genoa) on-demand Linux — $7.20/hr in us-east-2 — https://instances.vantage.sh/aws/ec2/hpc7a.96xlarge ↩
AWS hpc8a.96xlarge (2-socket AMD EPYC Turin) on-demand Linux — $7.92/hr in us-east-2 — https://instances.vantage.sh/aws/ec2/hpc8a.96xlarge ↩

The Case for GPU-Accelerated Data Analytics

2026-03-12T18:00:00+00:00

For analytics workloads that fit in fast memory, the hardware case for GPU is strengthening — but the story is more nuanced than raw compute numbers suggest.

TL;DR

AI agents are changing the analytics workload — agentic speculation is exploding demand for structured analytic compute.
CPU analytics has had a great run, but for in-memory workloads the gap is shifting to GPU — CPU-centered databases powered enterprise analytics for decades, but for workloads that fit in fast memory, GPU bandwidth has crossed into a clear and durable lead. Compute and cost advantages, once decisive, are now compressing as GPU prices rise faster than per-chip gains.
GPU-accelerated databases are rising in research and industry — Conferences have seen a wave of GPU database papers since 2020, and GPU acceleration is reaching production tools. Yet building correct, full-featured GPU query engines remains a formidable engineering challenge.
NVIDIA has built a moat with RAPIDS AI and libcudf — virtually every GPU-accelerated analytic system today is built on libcudf, making it the critical layer to understand in this space.

The Analytics Workload Is Changing Fast — Enter AI Agents

In 2025, LLM-powered AI agents started proving their value, and their adoption has been rapidly spreading across enterprises, particularly for data analytics and insights extraction. The 2025 State of AI in Enterprise report shows that companies are now moving from piloting the technology to actually deploying it in production, noting that “many companies focused on experimenting last year [2025] have crossed the threshold into operational AI systems.”

Databricks is at the forefront of adopting LLMs and agent technology, and it is worthwhile to follow how they have been preparing for the coming explosion in their adoption. Through their latest Lakebase architecture, Databricks shows they are positioning for both OLTP and OLAP workloads required by agents for the full automation of the data exploration and productionization pipelines.

This architecture eliminates much of the cost, complexity, and lock-in that have defined databases for decades, and it is especially powerful for modern AI and agent-driven workloads, where developers want to launch many instances, experiment freely, and pay only for what they use.

Their latest Genie product is their version of the AI agents that will carry out this work, driven solely by high-level natural language commands tied to business needs.

Genie Code can autonomously carry out complex tasks such as building pipelines, debugging failures, shipping dashboards, and maintaining production systems.

Together, these advances will help bring AI agents to the market, simplifying much of the data science workflow. But Databricks believes a much bigger wave is ahead, one where agents are unleashed to search for insights by trying many different paths. They call this Agentic speculation, “a high-throughput process of exploration and solution formulation for the given task,” which Databricks engineers envision will require redesigning data systems to be agent-first¹.

Overall, as agentic workloads become more and more prevalent, the sheer scale and inefficiencies of agentic speculation will become the bottleneck, and our data systems will need to evolve in response

The impact on the analytics workload will be profound. Future systems will be designed almost exclusively with AI agents as first-class users, performing exploration, identifying insights, and productionizing their solutions. All of this will be done from raw structured and unstructured data.

Where will the engineering bottlenecks be? Agentic speculation will dramatically increase the velocity of both code generation and analytical queries, vastly increasing the effective memory bandwidth and working set memory requirements of the underlying systems. AI agents will also become more deeply integrated into the data infrastructure to provide intelligent exploration. Do we have the right software and hardware to support this movement? Today, we see massive investment in serving inference from GPUs, but not enough analytics workloads have been accelerated, and this is likely to become a major bottleneck.

The question I am posing is whether current CPU-centered data processing systems will be capable of handling the scale needed to support these new agentic workloads.

CPU-Centered Analytics: Decades of Dominance

In the past few decades, analytic query engines have been very successfully built around the CPU architecture, featuring a growing number of high-performance server cores (in the hundreds), deep cache hierarchies (in tens of MBs), and vectorized operations taking advantage of wider SIMD instruction sets (up to 512 bits wide).

To meet ever-larger volumes of data stored in object stores, these engines moved towards disaggregated architectures that enable elastic scaling of compute and storage. Coupled with open columnar data formats like Parquet and Arrow, this shift has fostered a wide ecosystem of query engines built on a composable data philosophy. It has been a remarkable run.

The milestones speak for themselves. Snowflake’s 2016 architecture pioneered separating compute from storage entirely, proving that cloud-native disaggregation could deliver elastic, multi-tenant analytics at scale. On the single-node analytical engine front, DuckDB brought embeddable, vectorized OLAP to the edge; ClickHouse pushed columnar execution to extreme throughput on commodity hardware; and Umbra/CedarDB pushed the boundary on single-node performance with JIT query compilation via LLVM and a hybrid row/columnar storage engine capable of handling both transactional and analytical workloads on a single system.

The composable data systems movement² has further decoupled execution from storage, built on two key standards: Apache Arrow as the universal in-memory columnar format enabling zero-copy data exchange between engines, and Substrait as a portable, cross-language query plan representation that lets a plan produced by one system be executed by another. On top of these, Velox (Meta) and Apache DataFusion provide reusable, modular physical execution engines that plug into larger systems rather than reinventing the wheel. This composability is now flowing upstream into the dominant distributed compute platforms — Gluten brings Velox-backed native execution into Apache Spark, Apache DataFusion Comet does the same using DataFusion as the native Rust backend, and Presto has adopted Velox as its native C++ evaluation engine — extending the CPU performance frontier by replacing JVM-based execution with optimized native kernels.

CPU vs GPU Hardware Trajectories: The In-Memory Gap Is Shifting in GPU’s Favor

Yet even as software pushes the CPU performance frontier further, the underlying hardware is hitting diminishing returns. AMD’s EPYC Turin, today’s server CPU bandwidth leader, peaks at ~576 GB/s per socket (+25% vs Genoa’s ~461 GB/s) and ~15 TFLOPS FP32 (+~40% vs Genoa’s ~11 TFLOPS), with max DRAM capacity flat at 6 TB across both generations. Intel’s Xeon 6 (Granite Rapids) reaches ~409 GB/s (+33% vs Sapphire Rapids’ ~307 GB/s) and ~10 TFLOPS FP32 (~2× vs Sapphire Rapids’ ~4.8 TFLOPS), with capacity likewise flat at 4 TB. Meaningful gains but incremental, and capacity has effectively plateaued.

GPUs tell a different story. Driven by the insatiable demand for AI training and inference, NVIDIA’s flagship data-center superchips have advanced at a fundamentally different pace across just three generationsm the GH200 (Grace Hopper, 2023), GB200 (Grace Blackwell, 2024), and VR200 (Vera Rubin, 2025): memory bandwidth grew 9x from 4.9 TB/s to 44 TB/s, FP32 compute grew from 67 to 260 TFLOPS, and total unified memory capacity grew 3.4x generation-over-generation: 624 GB -> 864 GB (+39%) -> 2.1 TB (+143%). That puts VR200 at about 76x the bandwidth of a single EPYC Turin socket.

But for in-memory analytics, workloads whose active dataset fits within the fast memory tier (HBM for GPUs, DRAM for CPUs), raw hardware scaling alone does not determine the winner. A three-generation study across three lenses (compute parity, $1M bare-metal budget, equal AWS hourly spend) reveals two distinct trends pulling in opposite directions:

GPU memory bandwidth is the most durable advantage and is not eroding. It crossed above parity against a compute-equivalent CPU cluster between the GH200 and GB200 generations, and holds steady at 3.5–5.4× at equal spend across all three generations. The inflection is generational and structural.
GPU compute, cost, and Perf/W advantages are real but compressing. At equal spend, the FP32 advantage peaked at the H100 generation (~9.5× over Genoa on AWS) and has since fallen to ~5.2× for B200 — not because GPU compute plateaued, but because GPU prices are rising faster than per-chip compute gains. The Perf/W lead is narrowing for the same reason.
The capacity gap is large but driven by price, not physics. CPU DRAM holds 51–91× more memory at compute parity, but that collapses to 8–11× at equal spend. The difference is that DDR costs a fraction of HBM per gigabyte — once you normalize by budget, you are buying far more DDR capacity than the raw chip-count comparison suggests. If HBM prices fall relative to DDR over time, a trend already underway, this ratio will compress further in the GPU’s favor.

Assumptions: The directional conclusions above rest on specific cost and pricing inputs — rack-normalized GPU superchip prices ($39k–$188k per chip), AMD EPYC socket prices (~$8k–$14k), and AWS on-demand rates from April 2026. Cost figures are the most assumption-sensitive part of the analysis: GPU list prices vary by channel and contract, and cloud rates change frequently. The bandwidth and compute trends are hardware-spec-driven and more stable; the capacity and cost conclusions are pricing-driven and should be read as directional, not precise.

For a more detailed generation-by-generation comparison, see: GPU vs CPU for In-Memory Analytics: Bandwidth Holds as Compute and Cost Advantages Narrow Across Three Generations.

Coming next: The analysis above is scoped strictly to in-memory workloads. A follow-up post will delve into the big data case where datasets exceed GPU HBM capacity, and where the HBM bandwidth advantage disappears at the PCIe or NVMe bottleneck, and CPU DRAM’s structural capacity advantage becomes decisive for analytics at scale.

GPU-Accelerated Databases Are Rising in Research and Industry

Unsurprisingly, the database research community has been paying close attention since 2020, with top conferences like SIGMOD and VLDB regularly accepting papers evaluating and building GPU-accelerated databases — both hybrid CPU-GPU and fully GPU-native. Recent highlights include:

Rethinking Analytical Processing in the GPU Era³ (CIDR 2026) — Sirius, a GPU plugin for DuckDB that rethinks analytical processing natively on the GPU.
Scaling GPU-Accelerated Databases beyond GPU Memory Size⁴ (VLDB 2025) — tackles the fundamental GPU memory capacity bottleneck with a hybrid CPU-GPU filtering strategy, achieving a 3.5× speedup over SQL Server at 1 TB scale on a single A100.
GPU Database Systems Characterization and Optimization⁵ (VLDB 2024) — systematically characterizes GPU database performance bottlenecks and proposes optimizations for modern workloads.
A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics⁶ (SIGMOD 2020) — proposes Crystal, a GPU query library, and shows that full query GPU speedup can exceed the memory bandwidth ratio (up to 25×) due to CPU vectorization limitations.

On the industry side, 2025 saw GPU acceleration reach mainstream data tools:

GPU execution landed in CPU dataframe engines like Velox⁷ and Polars⁸.
The RAPIDS Accelerator for Apache Spark⁹ enabled faster migration to GPU-accelerated distributed data engineering and analytics workloads.
Voltron published the design paper for Theseus¹⁰, their petabyte-scale GPU accelerated query engine.

Despite genuine progress, building correct and performant GPU implementations of the full relational algebra remains enormously difficult. Managing GPU memory limits, PCIe transfer bottlenecks, operator fusion, and full SQL coverage is a hard engineering problem with no easy shortcut.

NVIDIA’s Moat: RAPIDS and libcudf

NVIDIA has seen this challenge coming for a while and has been systematically building a solution through its RAPIDS AI¹¹ ecosystem, first launched in 2018¹², well before the generative AI and LLM revolution had taken hold. At its core is a little-known C++ library, libcudf (and its sister libraries), a highly optimized, native GPU foundation that underpins virtually all GPU-accelerated analytic systems being built today.

It is the de facto single-node physical operator infrastructure in this space, and understanding it is the key to understanding how GPU databases actually work. And yet, despite its central role, in-depth technical coverage of libcudf’s internals is surprisingly scarce. Most available material stays at the user-facing API level, leaving critical questions about kernel design, memory management, and performance characteristics largely undocumented outside of the source code itself.

In future posts, I’ll thus be diving deeper into the technical internals of libcudf and answering questions such as:

❓ How does libcudf translate relational operators into parallel GPU kernels?
❓ What is the tooling like to evaluate the library’s performance?
❓ How is the libcudf used as a building block for larger distributed systems?

We are at an inflection point. The hardware gap between CPUs and GPUs is no longer a niche concern for ML engineers — it is becoming structurally relevant for anyone building or operating data systems at scale. For in-memory analytics, the shift is already underway: GPU bandwidth has crossed into a durable lead and the remaining gaps in cost and capacity are narrowing, not widening. The harder question — how this plays out when datasets exceed HBM capacity and the bottleneck shifts to PCIe or storage — is the subject of a future post. The research momentum, the industry adoption, and NVIDIA’s deliberate infrastructure investment all point in the same direction: GPU-accelerated analytics is moving from experimental to essential. The open question is not whether it will happen, but how fast the ecosystem matures and how much of the existing CPU-centric stack it displaces versus complements.

Excited about the momentum of GPU-accelerated analytics? Have questions about the software or hardware stack? Let me know below! 👇

References

Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First — https://arxiv.org/pdf/2509.00997 ↩
The Composable Data Management System Manifesto — VLDB 2023 — https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf ↩
Rethinking Analytical Processing in the GPU Era — https://arxiv.org/pdf/2508.04701 ↩
Scaling GPU-Accelerated Databases beyond GPU Memory Size — VLDB 2025 — https://vldb.org/pvldb/vol18/p4518-li.pdf ↩
GPU Database Systems Characterization and Optimization — VLDB 2024 — https://vldb.org/pvldb/vol17/p441-cao.pdf ↩
A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics — SIGMOD 2020 — https://arxiv.org/pdf/2003.01178 ↩
Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF — https://developer.nvidia.com/blog/accelerating-large-scale-data-analytics-with-gpu-native-velox-and-nvidia-cudf/ ↩
RAPIDS Adds GPU Polars Streaming, a Unified GNN API, and Zero-Code ML Speedups — https://developer.nvidia.com/blog/rapids-adds-gpu-polars-streaming-a-unified-gnn-api-and-zero-code-ml-speedups/ ↩
RAPIDS Accelerator for Apache Spark — https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/apache-spark-3/ ↩
Theseus: A Distributed and Scalable GPU-Accelerated Query Processing Platform Optimized for Efficient Data Movement — https://arxiv.org/pdf/2508.05029 ↩
RAPIDS AI — https://rapids.ai/learn-more/ ↩
GPU-Accelerated Data Analytics & Machine Learning (RAPIDS AI Launch, 2018) — https://developer.nvidia.com/blog/gpu-accelerated-analytics-rapids/ ↩

Cherif Jazra

On Pope Leo XIV’s Letter on Artificial Intelligence

Inside RAPIDS libcudf: a deep dive into a simple GroupBy aggregation

1. Introduction: Relational Algebra on GPUs

The Aggregation Problem

CPU vs GPU Challenges

2. Setup: Software and Hardware used

Library Versions

GB10 Device

Code Invoked

3. Architecture at a Glance: The Four-Phase Data Flow

4. The GroupBy.Sum() Algorithm

5. Deep Dive into each phase: Kernels and Data Structures

Phase 1: Hash set initialization

Phase 2: Key insertion and index mapping

Phase 3, Interlude: Dense output index remapping

Phase 4: Shared-memory accumulation + flush

Wrapup: Output Key Gather

6. Step-By-Step Example of the algorithm: from input rows to final output indices

Step 1: Input partitioning

Step 2: Phase 2 block-local rank assignment + global set insertion (compute_mapping_indices)

Step 3: extract_populated_keys(): compact global_set → unique_key_indices

Step 4: compute_key_transform_map(): invert unique_key_indices via thrust::scatter

Step 5: thrust::for_each_n: rewrites global_mapping_indices in-place with dense output rows

Step 6: Kernel 2, accumulate + flush (compute_shared_memory_aggs)

7. Algorithm Complexity Summary

8. Appendix: A Deeper Look into hash set global_set

Set design

Finding/Inserting a key in the set

NVIDIA GTC 2026 Accelerated Analytics - Part 2: Industry Use Cases and Training Labs

Industry Use Cases

🔗 [EDB] Supercharging Postgres for Agentic Analytics with Rapids Accelerator and Apache Iceberg

NVIDIA GTC 2026 Accelerated Analytics - Part 1: Technical Deep Dives

Technical Deep Dives

🔗 The Era of GPU Data Processing: From SQL to Search and Back Again — S81769

GTC 2026 Keynote — Part 3: Vera Rubin Hardware, OpenClaw & Robotics

Summary of Part 3 sections

Full Vera Rubin hardware stack — GPU, NVLink, Rubin Ultra, and Spectrum-X Groq LPX + DSX platform for AI factory optimization (38min)

OpenClaw, NemoClaw, Open Model Coalition (19min)

Robotics, Physical AI, & recap (14min)

References

NVIDIA GTC 2026 Conference: The Keynote

The Keynote

CUDA is 20 years old

The Vera Rubin POD is expanding: Seven Chips, Five Rack-scale Systems

Summary of the Keynote by section

Intro, Cuda flywheel, Graphics improvements (16min)

Accelerated Analytics (22min)

Cuda-X review and AI native companies (7min)

AI Inference Inflection + Overview of datacenter efficiency (Tokens/Watt) vs interactivity (Tokens/s per user) across different tiers (22min)

Full Vera Rubin hardware stack — GPU, NVLink, Rubin Ultra, and Spectrum-X Groq LPX + DSX platform for AI factory optimization (38min)

OpenClaw, NemoClaw, Open Model Coalition (19min)

Robotics, Physical AI, & recap (14min)

References

GTC 2026 Keynote — Part 2: Intro, Analytics, CUDA-X & Inference

Summary of Part 2 sections

Intro, Cuda flywheel, Graphics improvements (16min)

Accelerated Analytics (22min)

Cuda-X review and AI native companies (7min)

AI Inference Inflection + Overview of datacenter efficiency (Tokens/Watt) vs interactivity (Tokens/s per user) across different tiers (22min)

GTC 2026 Keynote — Part 1: Overview & Context

The Keynote

CUDA is 20 years old

The Vera Rubin POD is expanding: Seven Chips, Five Rack-scale Systems

References

GPU vs CPU for In-Memory Analytics: Bandwidth Holds as Compute and Cost Advantages Narrow Across Three Generations

The CPU Baseline

GPU Superchip Specifications

Study #1: NVIDIA GPU vs AMD CPU at Compute Parity

Study #2: Isocost Analysis - Bare Metal: What Does $1M of GPU Buy vs $1M of CPU?

Study #3: Isocost Analysis - Cloud Instance: GPU vs CPU at Equal Hourly Spend on AWS

Conclusion

Reference Tables

References

The Case for GPU-Accelerated Data Analytics

The Analytics Workload Is Changing Fast — Enter AI Agents

CPU-Centered Analytics: Decades of Dominance

CPU vs GPU Hardware Trajectories: The In-Memory Gap Is Shifting in GPU’s Favor

GPU-Accelerated Databases Are Rising in Research and Industry

NVIDIA’s Moat: RAPIDS and libcudf

Step 2: Phase 2 block-local rank assignment + global set insertion (`compute_mapping_indices`)

Step 3: `extract_populated_keys()`: compact global_set → unique_key_indices

Step 4: `compute_key_transform_map()`: invert unique_key_indices via thrust::scatter

Step 5: `thrust::for_each_n`: rewrites global_mapping_indices in-place with dense output rows

Step 6: Kernel 2, accumulate + flush (`compute_shared_memory_aggs`)