What is the purpose of a large disk cache if the page cache is always larger?

Question

Hard drives use onboard RAM as a cache, up to a few hundred megabytes on high-capacity drives. I can see how a small buffer would be necessary to match the raw read speed to the speed of the SATA/SAS interface, but what is the purpose of having a large cache? The operating system's page cache will almost always be significantly larger, and the OS already handles read-ahead.

Given that the disk cache is so much smaller than the page cache, I would think that it would have a miss rate of 100%, considering the page cache would be inclusive of the on-board cache.

There is a white paper which claims that a major benefit is the ability to populate the buffer with "free" read-ahead and read-behind that is obtained by the head while it has finished seeking and is merely waiting out the rotational latency:

That being said, based on traces we know that the cache hit rates in the disks are relatively high, despite the fact that we already have caching in multiple higher layers. This is surprising and still needs more exploration, but we suspect the main reason for this is the effectiveness of read ahead (and read behind). If so, most of the RAM actually in the disk should be used for this purpose, in addition to write buffering. Furthermore, the disk can do read ahead and read behind when it’s free – even when there are pending requests, since those requests might require some rotational time to service. The host does not have visibility into when it is free to extend a read.

Emphasis mine. Is this the primary purpose? If not, what is?

_{Obviously there is a benefit in writeback mode, but I'm talking about cached reads.}

It's not necessarily a cache, but the memory of the controller and firmware. The command queue at least need to be stored somewhere, so is the address translation table. — user3528438
– user3528438, Commented Oct 15 at 5:47
@user3528438 Hence why I ask the purpose of a large disk buffer. Surely the firmware does not require, say, 512 MiB. — forest
– forest, Commented Oct 15 at 5:48
One thing that will really help here is this: If there's an OS crash, the disk can still write its cache which could combat file corruption, whereas OS Page cache will stop in a severe crash. — LPChip
– LPChip, Commented Oct 15 at 7:49
"What is the purpose of a large disk buffer if the page cache is always larger?" - BTW your use of "disk buffer" is ambiguous, and misleading if refering specifically to the drive's on-board cache. The size of the page cache (on the host) is irrelevant in comparison to the drive cache. What has not yet been "read" (i.e. transferred to the host) can benefit from a drive cache. You overlook the total access time (seek time + head switch time + rotational latency) of HDDs that penalize almost every operation. "Is this the primary purpose?" - Yes. — sawdust
– sawdust, Commented Oct 15 at 9:53
What makes you think reads are important in sizing the disk cache in the first place? I don't think you can just say "obviously this is good for writes but I don't care about them, please answer as if writes didn't exist". For example, it is very common (at least among consumer drives) that SMR-based models have many times bigger caches than the equivalent CMR model, purely to make drive-managed SMR work reasonably well. So at least for these, it's the writes that matter, caching reads is just a free bonus that might or might not matter at all. — TooTea
– TooTea, Commented Oct 15 at 15:11

sawdust · Accepted Answer · 2025-10-16 09:39:07Z

You seem to focus primarily on the sizes of the host's page cache and an HDD's onboard cache. But the benefits of cache hits are mitigated by the (operational) costs.

The page cache could simply retain the requested blocks that have been read from the HDD, and hope that retaining them will result in cache hits if any of those blocks are requested again. The acquisition cost of such cache content is essentially zero, whereas when "the OS ... handles read-ahead" the costs incurred are the read operations from the HDD. This cost can be exacerbated and become a performance penalty if the read-aheads are excessive and have a low cache hit rate.

In similar fashion the HDD could also retain blocks already read in its cache. But the hit rate on those blocks should be close to nil with a reasonably effective and efficient OS. But the HDD could also populate its cache by acquiring read-ahead blocks at no cost or perhaps a small time penalty (i.e. deferring the next operation).

A disk operation requires an access time before the drive can perform the actual read/write operation on a sector. The three primary phases are (1) seek time, (2) head switch time, and (3) rotational latency. A sequential operation typically means that seek time and head selection time are zero. But the timing of when and where on the track that the disk controller actually begins reading the disk track is typically random.

The worst case scenario (for rotational latency) is the controller just missing the start of the target sector, and waiting for a complete revolution is required for the target sector to come back around. Normally this waiting is dead time, wasted doing nothing, lengthening the access time. But if the sectors following the target sector were read (instead of ignored) by the controller and saved in a cache, then a read-ahead can be accomplished while performing a normal operation. This read-ahead can be perform at absolutely no time cost to the host, i.e. essentially for free.

The number of possible sectors for a read-ahead depends on (a) the position of the target sector on the track (i.e. relative to the start/end indicated by the index mark), and (b) where the controller begins reading the track. Note that with zone bit recording, the number of sectors per track is variable, and the CHS address of an LBA is unknown (and typically irrelevant) to the host.

Given the previous worst case scenario of the controller just missing the start of the target sector, the worst sub-case scenario (i.e. the worst of the worst) is when the target sector is the last sector of the track. That means that there are zero sectors for a read-ahead. A best sub-case scenario (i.e. the best of the worst) is when the target sector is the first sector of the track. That means that almost a full track of sectors are available for a read-ahead. But for a random LBA and an arbitrary read start, the number of sectors that can be read-ahead for zero performance penalty should fall between those two extremes (but a non-linear probability and skewed toward zero).

If the disk controller is intent on populating its cache with read-ahead blocks, then a read operation could be extended past the target sector whenever the controller begins reading the track before the target sector (rather than after). Such an extended/prolonged read would only be "free" when there are no other pending disk operations queued up.

Bottom line: a disk controller could perform read-aheads and maintain an on-board cache with no performance cost to the host.

There is a white paper which claims that a major benefit is the ability to populate the buffer with "free" read-ahead and read-behind that is obtained by the head while it is still seeking the requested data:

Yes, such reads can be "free", i.e. no additional time to perform.
The utility/benefit of "read-behind" is questionable (but of course possible).
Your use of "seeking" is a poor word choice because that is not the access phase that is active at that point.

The host does not have visibility into when it is free to extend a read.

Emphasis mine. Is this the primary purpose?

I would assume so.

Your use of "seeking" is a poor word choice – Thank you. I've edited the question to clarify. You're right; It's not reading while it's seeking but while it's finished seeking and is waiting out the rotational latency. — forest
– forest, Commented Oct 16 at 15:30

jpa · Accepted Answer · 2025-10-16 08:11:59Z

As in the question, I'm limiting the answer to considering read access. As you mention, the purpose of the cache is more obvious for writes. One might go as far as to say that write caching is the purpose and read cache is a side benefit.

The host does not have visibility into when it is free to extend a read.

Emphasis mine. Is this the primary purpose? If not, what is?

I'm not sure what level of evidence would be sufficient to say this definitely, but I'd say that yes, this is the important factor during reads. For the hard drive, it is always free (if power usage is not considered) to read something when there is no active request.

For the operating system, the equation is different. If the operating system starts a read request for read ahead, the drive will complete that, even if it delays other requests. The NCQ command set defined in ACS-3 helps a bit by letting the drive reorder the requests, but the supported priorities are only "normal priority", "deadline-dependent priority" and "high priority". The is no way for the operating system to issue a request of "do this if the drive is otherwise idle". It could abort the read-ahead request, but it might end up aborting a request that is already almost complete.

Considering the relative sizes of the page cache vs. hard drive cache, the hard drive can speculatively read any nearby data until a new request arrives. It does not have to care about holding onto old data that might be accessed again, as the operating system cache will handle that.

Manufacturers usually increase cache size with increasing drive capacity. This could be due to number of read heads in the drive. The read speed increases and therefore you get more free read ahead during the same time period of operating system latency. On the other hand, it could be a result of market analysis of price vs. performance optimization for lower and higher end drives. For SMR drives, the lower write performance probably dominates the need for a large cache.

Romeo Ninov · Accepted Answer · 2025-10-15 06:55:01Z

2

It is not only the size of the cache. Internal disk cache is much "closer" to the source of information and (usually) is much faster than OS cache. Also it store information from near sectors (related to target read) and can be helpful in some cases.

Moreover the controller/firmware know much more about the disk internals and can more efficiently manage cache for disk itself.

At the end this is typical multilayered storage/cache and going to upper layers you loose "connection" with the source and get more abstract view to the information.

answered Oct 15 at 6:55

Romeo Ninov

8,8945 gold badges27 silver badges26 bronze badges

3

It may be "closer", but it has to pass through the page cache anyway...

forest
– forest

2025-10-15 06:56:33 +00:00
Commented Oct 15 at 6:56
1

"... but it has to pass through the page cache anyway.." - Irrelevant. If the host has requested a read of an LBA from the drive, would you rather have (a) the drive transfer the block quickly from its cache, or (b) wait for the drive to access the sector before it can transfer the block?

sawdust
– sawdust

2025-10-15 09:58:47 +00:00
Commented Oct 15 at 9:58
3

@sawdust If both caches are cold, the drive will have to access the sector like normal. Once it does that, it'll be resident in both the drive's cache and the page cache. Then the next request will read it from the page cache. Would there ever be a common circumstance where it would get evicted from the page cache, yet would still remain on the drive cache?

forest
– forest

2025-10-15 10:09:44 +00:00
Commented Oct 15 at 10:09
7

"Internal disk cache [...] is much faster than OS cache." - Not from the perspective of applications. When a process performs a read that is serviced from the OS cache, this only involves the CPU making a memory copy in RAM. In certain situations, even the copy isn't necessary. A read request that is serviced from the internal disk cache on the other hand has to go through an interface like SATA or NVME, which has both higher latency and lower bandwidth than system RAM.

marcelm
– marcelm

2025-10-15 17:36:31 +00:00
Commented Oct 15 at 17:36
1

@Forest The hard drive controller's cache being closer matters for writes. Its rotation/battery/capacitor can't supply power to the operating system when power is lost. So with the cache being powered by the hard drive controller the application can spend less time waiting for COMMIT.

93Iq2Gg2cZtLMO
– 93Iq2Gg2cZtLMO

2025-10-15 18:31:13 +00:00
Commented Oct 15 at 18:31

| Show 5 more comments

RonJohn · Accepted Answer · 2025-10-16 15:50:53Z

Frame Challenge.

I can see how a small buffer would be necessary to match the raw read speed to the speed of the SATA/SAS interface.

"a few hundred megabytes" is just 3% of 1% of a (relatively ancient 2007) 1TB drive. It's just 0.3% of a 10TB drive. That's a pretty darned tiny buffer cache.

That original SATA and SAS could drain that "few hundred megabytes" in just two seconds. Current versions can drain that "few hundred megabytes" in a fraction of a second.

Thus, I think the whole premise of your question is flawed, because "a few hundred megabytes" is what you say you can understand: a "small buffer (that is) necessary to match the raw read speed to the speed of the SATA/SAS interface."

By "large" I mean relative to smaller disk caches, not relative to the disk capacity or page cache. — forest
– forest, Commented Oct 16 at 15:11

Stack Exchange Network

What is the purpose of a large disk cache if the page cache is always larger?

4 Answers 4

Frame Challenge.

You must log in to answer this question.

Linked

Hot Network Questions

What is the purpose of a large disk cache if the page cache is always larger?

4 Answers 4

Frame Challenge.

You must log in to answer this question.

Linked

Related

Hot Network Questions