Lecture 2: Hardware Essentials for Caching

Open visualization in new tab

Learning Objectives

Prerequisites

Section 1: CPU and Memory Needs

The Engine Room: Deep Dive into CPU and Memory

Welcome to our exploration of hardware essentials for caching servers. When we talk about caching, we are fundamentally talking about speed. The entire purpose of a cache is to serve data faster than the primary data store. To achieve this speed, we rely on the fastest components in a modern computer system: the Central Processing Unit (CPU) and Random Access Memory (RAM). These two components form the engine room of any high-performance caching server. Getting their configuration right is not just a matter of performance tuning; it is the foundation upon which a successful caching strategy is built.

In-memory caches, such as Redis or Memcached, treat RAM as their primary storage medium. This is a paradigm shift from traditional database systems that are disk-oriented. For an in-memory cache, the amount of available RAM directly defines the maximum size of your cache. The speed of that RAM, combined with the CPU's ability to process requests, dictates your system's throughput and latency. In this section, we will dissect these two critical components, moving beyond simple specifications to understand the nuanced interplay between their features and the demands of caching workloads.

The Brains of the Operation: The CPU

The CPU executes the caching software's instructions, manages connections, serializes and deserializes data, and performs any computations required. While it might seem that caching is a simple "get and set" operation, the reality at scale is far more complex, and the choice of CPU has profound implications.

Core Count vs. Clock Speed: A Classic Debate

The most common CPU debate revolves around whether to prioritize a higher number of cores or a faster clock speed for each core. The correct answer depends entirely on the caching software and the workload.

The Modern Synthesis: For most modern caching servers, a balanced approach is best. Seek a CPU with a reasonably high clock speed (e.g., a base clock over 2.5 GHz with a high turbo frequency) and a sufficient number of cores to handle your expected concurrency and background tasks (8-16 cores is a good starting point for a moderately busy server).

CPU Architecture: x86 vs. ARM

For decades, the server market has been dominated by the x86-64 architecture from Intel (Xeon) and AMD (EPYC). However, the ARM architecture, long dominant in mobile devices, has made significant inroads into the data center with offerings from companies like Ampere and cloud providers like AWS (Graviton processors). ARM-based servers often offer a higher core count at a lower power consumption level, leading to a better performance-per-watt ratio. For scale-out caching workloads, where you might run hundreds of small Memcached instances, the cost and power savings of ARM can be substantial. The primary consideration is software compatibility, but today, most major open-source caching solutions and Linux distributions have excellent support for the ARM64 architecture.

CPU Caches (L1, L2, L3): Caching within the Cache

It's a fascinating recursion: the CPU itself relies heavily on its own internal caches to function quickly. These caches (L1, L2, and L3) are small, extremely fast pools of SRAM built directly onto the CPU die. They store data that the CPU is likely to need again soon, saving a trip to the much slower main system RAM.

For caching workloads, a large L3 cache can be a significant performance booster. When the caching server's "hot" data structures or frequently accessed key-value pairs fit within the L3 cache, request latency can drop dramatically. This is because the CPU core can satisfy the request without ever having to go off-chip to main memory. When evaluating CPUs, a larger L3 cache is almost always a desirable feature for a caching server.

NUMA (Non-Uniform Memory Access)

In multi-socket servers (systems with more than one physical CPU), NUMA is a critical concept. In a NUMA architecture, each CPU has its own "local" bank of memory. Accessing this local memory is very fast. However, if a process running on CPU 1 needs to access data stored in the memory bank local to CPU 2, it must traverse a slower interconnect between the CPUs. This "remote" memory access introduces additional latency. Caching servers are highly sensitive to memory latency, so unmanaged NUMA effects can lead to inconsistent performance. The solution is often to use tools (like `numactl` on Linux) to "pin" the caching server process to a specific CPU and ensure it only allocates memory from its local memory node. This ensures all memory access is fast and predictable.

The Lifeblood of Caching: Memory (RAM)

If the CPU is the brain, then RAM is the heart and circulatory system of an in-memory caching server. For solutions like Redis and Memcached, RAM is not just a performance enhancement; it is the primary storage medium. The quantity and quality of your RAM directly define the capacity and reliability of your cache.

Capacity: How Big is Your Cache?

This is the most straightforward aspect. The total amount of data you can store in an in-memory cache is limited by the amount of physical RAM, minus what's needed for the operating system and the caching process itself. When sizing memory, you must account for:

Running out of memory in a caching server can lead to keys being evicted (deleted) unexpectedly, or in a worst-case scenario, the process crashing. Therefore, careful capacity planning is essential.

Speed and Bandwidth: DDR4 vs. DDR5

System memory comes in different generations, with DDR4 and DDR5 being the most common in modern servers. DDR5 offers significantly higher bandwidth (the rate at which data can be read or written) than DDR4. For a cache server handling millions of requests per second, higher memory bandwidth means the CPU can be fed data more quickly, increasing overall throughput. While latency (the time to access the first piece of data) is also important, for bulk data movement involved in serving many concurrent requests, bandwidth often becomes the limiting factor. If your budget and platform support it, choosing DDR5 is a wise investment for a new, high-performance caching server.

ECC Memory: The Non-Negotiable Feature

Standard consumer-grade RAM is susceptible to random, single-bit flips caused by background radiation or electrical interference. This can silently corrupt data. A "0" might become a "1", or vice versa. In a desktop PC, this might cause a rare, inexplicable crash. In a server acting as a source of truth for session data or application state, this is a catastrophic failure. Error-Correcting Code (ECC) memory is a type of RAM that can detect and correct single-bit errors in real-time. For any production caching server, ECC RAM is not optional; it is a mandatory requirement for data integrity and system stability. The small additional cost is negligible compared to the cost of debugging data corruption issues.

Example: Memory Sizing for a Session Store

Let's calculate the memory required for a Redis instance to store user sessions.

Calculation:

  1. Memory for Session Data: 100,000 users * 4 KB/user = 400,000 KB = 400 MB
  2. Memory for Key Overhead: 100,000 keys * 64 bytes/key ≈ 6.4 MB
  3. Total Data Memory: 400 MB + 6.4 MB ≈ 407 MB
  4. Add Replication & Buffer Headroom (e.g., 25%): 407 MB * 1.25 ≈ 509 MB
  5. Add OS & System Headroom (e.g., 1 GB): 509 MB + 1024 MB ≈ 1.5 GB

Based on this, a server with 2 GB of RAM would be a safe minimum, but choosing a server with 4 GB or 8 GB of RAM provides comfortable headroom for growth and unexpected spikes in usage. Notice how the session data itself is only a fraction of the total required RAM.

Did You Know?

Early versions of Memcached were intentionally designed to be single-threaded. The creator, Brad Fitzpatrick, reasoned that for a simple in-memory hash table, the overhead of locking and context switching required for multi-threading would be greater than the benefit. The philosophy was to keep the server code simple and extremely fast on a single core, and to achieve scale by running many independent Memcached instances on many simple servers—a concept known as horizontal scaling or "scaling out." This design choice had a profound influence on how large-scale caching architectures were built for many years.

Section 1 Summary

Reflection Questions

  1. Why might a high clock speed be more beneficial than a high core count for a single-instance, single-threaded caching application like an older version of Redis?
  2. How would you justify the extra cost of ECC RAM to a project manager for a critical caching server that will store financial transaction data temporarily?
  3. Your team notices that your caching server's performance is highly variable, with some requests being much slower than others. The server has two CPUs. What architectural feature might be responsible, and what is your first step to diagnose and fix it?

Section 2: Storage and Network Design

The Lifelines: Designing Storage and Network Infrastructure

While the CPU and memory form the high-speed core of a caching server, they do not operate in a vacuum. The storage and network subsystems are the critical lifelines that connect this core to the rest of the world. A poorly designed storage system can jeopardize data durability and slow down server restarts, while an inadequate network can become the primary bottleneck that renders your fast CPU and memory useless. In this section, we will explore the roles of storage and networking, not as afterthoughts, but as co-equal partners in a high-performance caching architecture.

The Foundation: Storage Design

It might seem counterintuitive to focus on storage for "in-memory" caching, but storage plays several vital roles that are essential for a robust and manageable system.

The Roles of Storage in a Caching Server
Choosing the Right Storage Technology

The performance characteristics of different storage technologies vary dramatically. Selecting the right one is a trade-off between performance, cost, and endurance.

Understanding SSD Endurance (DWPD)

SSDs have a finite lifespan, determined by the number of write cycles their NAND flash cells can endure. This is measured in Drive Writes Per Day (DWPD). A 1 TB SSD with a 1 DWPD rating can be fully written to once per day, every day, for its warranty period (typically 5 years). Caching workloads with persistence, like AOF, are extremely write-heavy. Using a consumer-grade SSD (often with a DWPD of < 0.3) in such a role will lead to premature failure. Enterprise SSDs come in different classes: "Read-Intensive," "Mixed-Use," and "Write-Intensive," with DWPD ratings ranging from <1 to 10 or more. It is critical to analyze your expected write workload and choose an SSD with an appropriate endurance rating to ensure reliability.

The Conduit: Network Design

A caching server is only as fast as its connection to the applications that use it. In many well-tuned systems, after optimizing the CPU and memory, the network becomes the final and most significant bottleneck. Every request and every response travels over the network, and its characteristics—latency and bandwidth—define the user-perceived performance.

Latency vs. Bandwidth: The Critical Distinction

These two terms are often used interchangeably, but they measure different things, and for caching, latency is usually the more important metric.

A typical cache operation (e.g., `GET mykey`) involves a very small amount of data. The performance of this operation is dominated by latency, not bandwidth. You could have a 100 Gbps network link, but if the latency is high (e.g., because the server is in a different continent), the request will still be slow. The goal in network design for caching is to minimize latency at every step.

The Network Interface Card (NIC)

The NIC is the server's gateway to the network. Its capabilities are a primary determinant of network performance.

Switching and Physical Topology

The server's NIC is only one part of the equation. The network switches and the physical layout of the network also play a huge role in minimizing latency.

Example: Storage and Network for Different Caches

Scenario A: Memcached Cluster

Memcached is a pure in-memory cache with no persistence. Its hardware needs reflect this simplicity.

  • Storage: A pair of small, inexpensive SATA SSDs in a RAID 1 mirror for the OS. Since no data is written to disk, storage performance and endurance are not concerns.
  • Network: 10 GbE NIC. Latency is key. Since Memcached scales by adding more nodes, low-latency communication between the application servers and the many cache nodes is critical for overall application performance.
Scenario B: Redis with AOF Persistence

This server needs to provide in-memory speed plus durability, which places heavy demands on storage.

  • Storage: A pair of high-endurance, "Mixed-Use" or "Write-Intensive" NVMe SSDs in a RAID 1 mirror. Every write command is appended to the AOF, so the storage must sustain a high rate of small, random writes with very low latency.
  • Network: 10 GbE or 25 GbE NIC. The network needs to handle the client request traffic while also supporting replication traffic to a secondary server, which can be substantial.

Did You Know?

To achieve massive scale, Facebook developed a custom Memcached proxy called `mcrouter`. When a web server needs data, it sends a request to its local `mcrouter` instance. This proxy, based on a complex and constantly updated map of the entire caching infrastructure, intelligently routes the request to the correct Memcached server out of tens of thousands of servers spread across multiple data centers. This architecture demonstrates that at scale, the network routing and topology are just as important as the individual cache servers themselves (DeCandia et al., 2007).

Section 2 Summary

Reflection Questions

  1. You have a budget for either an NVMe SSD upgrade (from SATA SSD) or a 10 GbE NIC upgrade (from 1 GbE), but not both. For a write-heavy Redis cache with AOF persistence, which upgrade would you prioritize and why?
  2. Explain the concept of RDMA to a non-technical manager. Why would it be a worthwhile investment for a company building a large-scale, real-time financial analytics platform using Apache Ignite?
  3. Your NGINX cache, which is disk-backed, is performing poorly. Users complain of slow load times for assets that should be cached. Monitoring shows low CPU and RAM usage but high disk I/O wait times. What is the likely hardware bottleneck, and what specific storage technology would you recommend to fix it?

Section 3: Comparative Hardware Analysis

Tying It All Together: Hardware Profiles for Real-World Solutions

We have deconstructed the individual components—CPU, memory, storage, and network. Now, we will synthesize this knowledge to build complete hardware profiles for different categories of caching software. As noted by Jainandunsing (2025), caching solutions can be broadly categorized by their resource requirements, from lightweight key-value stores to heavyweight distributed data grids. The key takeaway is that there is no one-size-fits-all server; the optimal hardware configuration is a direct reflection of the software's architecture and the intended use case. This section will provide a comparative analysis, equipping you to make informed decisions when architecting or purchasing hardware for your caching needs.

Categorizing Caching Solutions by Hardware Weight

We can classify caching solutions into three broad tiers based on their typical hardware footprint and architectural complexity.

Tier 1: Lightweight In-Memory Caches (e.g., Redis, Memcached)

These solutions are the sprinters of the caching world. They are designed to do one thing—store and retrieve key-value pairs in memory—and do it exceptionally fast. Their hardware profiles are optimized for low latency and high throughput on simple operations.

Tier 2: Web & Proxy Caches (e.g., NGINX, Varnish Cache)

These systems sit in front of web applications, caching HTTP responses to reduce load on the backend servers. They can operate in memory, on disk, or a hybrid of both. Their performance is tied to I/O in all its forms: network I/O, memory I/O, and disk I/O.

Tier 3: Heavyweight Distributed Data Grids (e.g., Apache Ignite, Couchbase Server)

These are far more than simple caches. They are distributed, in-memory platforms that can offer database-like features, including SQL querying, transactions, and distributed computations, all while maintaining the speed of an in-memory system. Their hardware requirements are the most substantial.

Cloud vs. On-Premise Hardware Decisions

The choice of where to deploy your caching server—in a public cloud (like AWS, Azure, Google Cloud) or on-premise in your own data center—has significant hardware implications.

Comparative Table of Minimum Hardware Requirements

This table summarizes the minimum requirements for various caching solutions for a small-scale workload, based on data from Jainandunsing (2025). This illustrates the relative "weight" of each solution.

Component Memcached Redis (User Sessions) NGINX (Caching) Apache Ignite Couchbase Server
CPU 1 core @ 1.5+ GHz 1 core @ 1.5+ GHz 1 core @ 1.5+ GHz 2 cores @ 2.0+ GHz 2 cores @ 2.0+ GHz
RAM 256-512 MB 256-512 MB 512 MB 2 GB 4 GB
Storage 2 GB SSD (Logs/OS) 2-5 GB SSD 5-10 GB SSD 5-10 GB SSD 20 GB SSD
Network 100 Mbps-1 Gbps 100 Mbps-1 Gbps 100 Mbps-1 Gbps 1 Gbps 1 Gbps

Note: These are absolute minimums for basic functionality. Production systems require significantly more resources for performance, redundancy, and scale.

Did You Know?

Netflix, one of the world's largest users of caching, built a sophisticated system called EVCache based on Memcached. It runs on thousands of Amazon EC2 instances. Their engineering team continuously benchmarks and analyzes the performance of different EC2 instance types. A change in their caching access patterns might lead them to switch from a memory-optimized instance to a compute-optimized one, or vice-versa, to achieve the best performance-to-cost ratio. This demonstrates that hardware selection is not a one-time decision but a continuous process of optimization, even in the cloud (Netflix Technology Blog, 2017).

Section 3 Summary

Reflection Questions

  1. Your team is building a new application that will require a distributed cache for both simple key-value lookups and complex SQL-like queries on the cached data. Based on hardware profiles, would you start with Redis or Apache Ignite? Justify your answer in terms of initial hardware cost and future scalability.
  2. A startup wants to use Varnish to cache their website content. They have a very limited budget. Would you recommend they deploy it on a cheap cloud VM or a repurposed, older on-premise server? Discuss the pros and cons of each approach in terms of performance, reliability, and cost.
  3. Looking at the comparative table, why do you think Apache Ignite and Couchbase Server have a minimum RAM requirement that is 4-8 times higher than Redis or Memcached? What architectural differences does this imply?

Glossary

CPU Cache
A small, extremely fast memory built into the CPU (L1, L2, L3) used to store frequently accessed data from main RAM, reducing access latency.
DWPD (Drive Writes Per Day)
An endurance rating for SSDs indicating how many times the drive's total capacity can be written per day for its warranty period.
ECC RAM (Error-Correcting Code RAM)
A type of system memory that can detect and correct common kinds of internal data corruption, essential for server stability.
IOPS (Input/Output Operations Per Second)
A performance metric for storage devices measuring the number of read and write operations it can perform per second.
Latency
The time delay in data communication. In networking, it's the time for a packet to travel from source to destination (often measured as Round-Trip Time).
NIC (Network Interface Card)
The hardware component that connects a computer to a computer network.
NUMA (Non-Uniform Memory Access)
A memory architecture for multi-CPU systems where the access time depends on the memory location relative to the processor. Access to local memory is faster than access to remote memory (memory connected to another CPU).
NVMe (Non-Volatile Memory Express)
A high-performance storage protocol and interface that connects SSDs directly to the PCIe bus, offering significantly lower latency and higher throughput than SATA.
RDMA (Remote Direct Memory Access)
A technology that allows network adapters to transfer data directly to or from application memory on another computer, bypassing the CPU and OS, which dramatically reduces latency.

References

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., & Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. SIGOPS Operating Systems Review, 41(6), 205–220. https://doi.org/10.1145/1323293.1294281

Jainandunsing, K. (2025). Caching servers hardware requirements & software configurations (Version 1.0). [Internal Document].

Leibiusky, J., & Josiah, C. (2011). Redis in action. Manning Publications.

Netflix Technology Blog. (2017). EVCache: The tail at scale. Retrieved from https://netflixtechblog.com/evcache-the-tail-at-scale-1-45f06b853535

Tanenbaum, A. S., & Austin, T. (2012). Structured computer organization (6th ed.). Pearson.

Back to Course Index