The NT kernel’s virtual memory subsystem is absolutely terrifying. It’s over 300,000 lines of code (a similar size to a cut down, but still, usable Linux kernel including filesystem and network stacks and VirtIO drivers). A lot of the complexity comes from two early decisions that rarely make sense today:
Everything must be pageable.
Don’t make promises you can’t keep.
The first of these was vital on early NT systems. NT 3.51 required 12MiB of RAM. NT 4 ran quite well with 32. Windows 95 (released slightly before NT 4) allegedly needed 4 but really wanted 8 or 16. Moving people from the DOS-based line to NT was a priority. If the kernel has 1 MiB of data that can’t be swapped out, that’s a problem. If the kernel can’t swap out an entire process, that’s a problem.
NT wires a few things that are necessary for swapping (e.g. the disk driver) but very little else. When a page is swapped out, all of the metadata to find it has to fit in the not-present page table entry for the page. Supported systems all have a valid bit in PTEs and ignore all other bits in hardware if the valid bit is not set, so this gives 31 or 63 bits of state per swapped out page. This decision makes page-table pages swappable as well. You can swap out all of a process’s memory, and all of its page tables, right up to the root. From there, you can demand page everything back in. First you’ll hit page faults during page-table walk, then you’ll hit page faults for the pages that those represent, but an idle process can be completely swapped out, as can most kernel state.
This stopped being a big win by the time memory was measured in GiBs. Wasting a few MiBs of memory in exchange for performance was typically a win by then. Most *NIX systems don’t page out kernel memory (except the buffer cache, which is the vast majority of kernel-managed state) and keep look-aside structures for swapped out memory. This significantly simplifies a lot of the kernel code (not just in the VM subsystem but in other things that hold locks and can guarantee that they won’t trigger swapping when they call other kernel routines).
The second choice also made sense at the time. When memory was scarce, you needed to gracefully handle it on common code paths. With 8 MiBs of RAM, exhausting memory was common and so the kernel made a guarantee that, if it promised memory then the memory would definitely be there when you tried to use it. This led to a second requirement: all memory is fungible. NT keeps a commit charge for accounting. Every time a process (or bit of the kernel) wants a page, the commit charge is incremented. This must not exceed the total of swap plus memory.
When you touch a page, it may be allocated for the first time, but there’s always either free RAM to allocate it from or there’s swap space that can be used to page out something else to make space. This works only if all pages are the same size. This assumption does not hold with MTE or CHERI (where a swapped-out page stores metadata and so is slightly larger, but a page not using these features is smaller) and definitely doesn’t hold with compressed swap (where the size of spa swapped out page is data dependent). *NIX systems typically do overcommit instead: calls to mmap will succeed, but may fail to actually provide the memory when it’s accessed (or may require another process to be killed).
In theory, overcommit would lead to significantly worse reliability. On memory-constrained systems, it definitely does but on large systems (modern mobile phones on up), it doesn’t. Most code doesn’t actually check for allocation failure (or, if it does, doesn’t have reliable error-handling paths). Allocation failure is sufficiently rare that it’s hard to handle. On Windows, it throws an SEH exception, which nothing expects and it percolates up the stack until you find a catchall. Then you discard one run loop iteration’s work and often end up with things in an undefined state (sorry, you dropped one outbound packet, but I’m sure nothing will be confused by the following one arriving). If you can’t gracefully recover from allocation failure then you get better reliability by making memory allocation failure rarer. The few things that do gracefully recover can happily pre-fault on systems with overcommit and use more RAM in exchange for better determinism.
The accounting mechanism means that you often exhaust commit long before you exhaust memory. My work machine had 128 GiBs of RAM and I would typically see allocation failures with 50 GiBs free when I initially provisioned it with ‘only’ 128 GiBs of swap space. Lots of processes allocate memory that they don’t touch and this consumes commit charge but not memory. Then there isn’t enough available commit charge to allow allocation and things start failing. Garbage collected languages often handle this by running the GC more aggressively, which hurts performance. Oh, and for extra fun, NT can deliver an event when memory is low, but this is delivered when memory is low, not when commit charge is nearly exhausted, so you can have allocations failing but the event isn’t triggered.
These decisions made perfect sense in the early ‘90s but haven’t for at least 15 years. Unfortunately, things like SQL Server make strong assumptions about the behaviour of the NT VM subsystem and so changing anything is likely to cause huge performance regressions. They also build their reliability guarantees on top of it (hint: if your reliability guarantees come from processes never crashing, you’ve done something wrong), so you can’t optimise for 99% of Windows programs without hurting one’s that make MS a lot of money.
Why did the early NT versions require so much more RAM than Windows 95? Was it due to inevitable overhead of being fully 32-bit, fully preemptively multitasked, and having strong memory protection and security? Or was there some overengineering as well?
Some of it was just page tables. NT moved a bunch of things into userspace services that each needed their own address space. That’s only a few tens of KiBs per process, but a dozen of those add up. Some of it came from being fully 32-bit (no 16-bit thunks). Some came from features that weren’t present in the 9x series, such as NTFS with ACL support, multi-user support, and access control on all kernel objects. I am not intimately familiar with NT of that era, but I don’t think it was particularly over-engineered. Its requirements were fairly similar to contemporary UNIX systems (it’s direct competitors).
Unfortunately, things like SQL Server make strong assumptions about the behaviour of the NT VM subsystem and so changing anything is likely to cause huge performance regressions.
Potentially solvable by porting the Linux port of SQL Server back onto Windows.
I don’t think it’ll actually happen but it would be very funny.
Edit: actually, thinking about it more, how do those assumptions affect performance of SQL server now that you actually can have it running on Linux? :)
Potentially solvable by porting the Linux port of SQL Server back onto Windows
There is no Linux port of SQL Server to Linux. There’s a port of a big chunk of the NT kernel and Win32 libraries to Linux (‘SQL PAL’, formerly known as Drawbridge), SQL Server runs on top of this.
It’s articles like these that sometimes make me wonder just how many applications out there are running 10x-100x slower than they need to because of some weird process priority minutia they never thought to dig into.
It would be nice if there was a feature to limit the amount of memory the application could allocate, not just how much of the allocated data can be in RAM at once. “Hey, Teams - you get 500 MB of RAM and no more. And Slack, 200 MB should be quite enough for chatting”. Maybe there is such a feature too?
You don’t really want to do it on Linux because malloc implementations typically make heavy use of overcommit. They’ll often ask for memory in 2-8 MiB chunks to amortise system call costs, knowing that the kernel won’t provide physical pages for all of it unless it’s used. Similarly, applications often allocate large buffers and rely on the same behaviour to allocate them when they’re use (allocating 1 MiB and reading 17 KiB into it is cheaper than reading 4 KiB at a time, resizing the underlying allocation, and stopping at the end). Address space is basically free on 64-bit platforms, only physical memory is expensive.
On Windows, you can limit the amount of commit that a process can charge, which has a similar effect, but Windows libraries tend to be somewhat more conservative about using up commit charge, because it’s a scarce resource.
Yes, with a minor difference being that I’m pretty sure any modern application will crash and burn if it’s denied all the RAM it wants…
That’s basically how it worked on the classic Mac OS too. Constant having to adjust the application’s allocation if it didn’t get enough.
The funny thing is, I’m not even sure why applications had to have a fixed allocation. The OS allocates all memory with a void** scheme, so it could relocate chunks anywhere, instead of having to have all app memory be contiguous.
That’s a very good question. Classic MacOS didn’t have memory protection so it couldn’t page unused stuff out of memory, afaik, but the double-pointer to all allocations should make it possible to shuffle stuff around…
Perhaps it aided fragmentation? With a double-pointer you can move memory easily, but not split or coalesce it without the application having to help…
Edit: I have been informed that classic MacOS (and win3.x) did indeed have swapping, you have to lock a handle to operate on it and while it is unlocked its backing memory can be swapped out.
“Probably because of some creative scheme to conserve memory on the earliest Mac models, or maybe to make diagnosing problems easier” I guessed, and found some interesting history of Classic memory management: https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_management
I’d like to read a response from someone who worked on this feature or was there when it was implemented. I assume the team that implemented this in Windows 7 (or Vista?) wasn’t incompetent.
Vista. Vista was RAM constrained. Trimming working sets does mean sending memory from that process to the front of the queue in event of paging, and will reduce paging of a foreground process.
In Vista, the thing that sets this bit is the search indexer.
Presumably 32Mb was a good choice for the search indexer. It might have been a good choice for other background tasks too - when the whole system needs to run in 448Mb of RAM, a background process just can’t have a 64Mb working set. Unfortunately, one bit of state that controls a pile of values is not going to work well for arbitrary processes with arbitrary memory requirements, particularly 16 years later.
It’s normally a good idea to express this kind of limit as a fraction of total RAM. Most *NIX systems compute things like the default for the maximum number of file descriptors like this, for example. 10 years after you introduce them, RAM is so big that users don’t see the limits.
It’s normally a good idea to express this kind of limit as a fraction of total RAM.
I think Bruce is trying to challenge this notion. If a program sweeps 64Mb of RAM, it doesn’t matter how much physical RAM is present, the program will degrade if it gets less. Whether a 4Gb system should allow a background process to run well in 64Mb or force it to run badly in 32Mb depends on what the rest of the system is doing, which is why he’s suggesting leaving the balance set manager to do its job.
If the system were actually paging, it might make sense to nudge background processes to page more to have foreground processes page less. But even that should be based on the memory needs of the programs, not the physical memory in the machine.
If you’re able to dynamically tune these things, that will always perform better than static heuristics (unless computing the dynamic properties costs so much that it slows the system down more than a bad heuristic). I suspect that having background processes go slowly on Vista with under 512 MiB of RAM was less bad than the alternatives. If this limit had grown with RAM size, it would probably be 128 MiB or more now and no one would notice it. If you have more cores now and have some idle cycles to spare computing better memory pressure metrics then you can tune it even better, but adding anything to the NT kernel’s virtual memory subsystem is definitely ‘here be dragons’ territory. I think there are approximately three people who understand that code, of whom two will claim that they don’t and only Landy really does.
The NT kernel’s virtual memory subsystem is absolutely terrifying. It’s over 300,000 lines of code (a similar size to a cut down, but still, usable Linux kernel including filesystem and network stacks and VirtIO drivers). A lot of the complexity comes from two early decisions that rarely make sense today:
The first of these was vital on early NT systems. NT 3.51 required 12MiB of RAM. NT 4 ran quite well with 32. Windows 95 (released slightly before NT 4) allegedly needed 4 but really wanted 8 or 16. Moving people from the DOS-based line to NT was a priority. If the kernel has 1 MiB of data that can’t be swapped out, that’s a problem. If the kernel can’t swap out an entire process, that’s a problem.
NT wires a few things that are necessary for swapping (e.g. the disk driver) but very little else. When a page is swapped out, all of the metadata to find it has to fit in the not-present page table entry for the page. Supported systems all have a valid bit in PTEs and ignore all other bits in hardware if the valid bit is not set, so this gives 31 or 63 bits of state per swapped out page. This decision makes page-table pages swappable as well. You can swap out all of a process’s memory, and all of its page tables, right up to the root. From there, you can demand page everything back in. First you’ll hit page faults during page-table walk, then you’ll hit page faults for the pages that those represent, but an idle process can be completely swapped out, as can most kernel state.
This stopped being a big win by the time memory was measured in GiBs. Wasting a few MiBs of memory in exchange for performance was typically a win by then. Most *NIX systems don’t page out kernel memory (except the buffer cache, which is the vast majority of kernel-managed state) and keep look-aside structures for swapped out memory. This significantly simplifies a lot of the kernel code (not just in the VM subsystem but in other things that hold locks and can guarantee that they won’t trigger swapping when they call other kernel routines).
The second choice also made sense at the time. When memory was scarce, you needed to gracefully handle it on common code paths. With 8 MiBs of RAM, exhausting memory was common and so the kernel made a guarantee that, if it promised memory then the memory would definitely be there when you tried to use it. This led to a second requirement: all memory is fungible. NT keeps a commit charge for accounting. Every time a process (or bit of the kernel) wants a page, the commit charge is incremented. This must not exceed the total of swap plus memory.
When you touch a page, it may be allocated for the first time, but there’s always either free RAM to allocate it from or there’s swap space that can be used to page out something else to make space. This works only if all pages are the same size. This assumption does not hold with MTE or CHERI (where a swapped-out page stores metadata and so is slightly larger, but a page not using these features is smaller) and definitely doesn’t hold with compressed swap (where the size of spa swapped out page is data dependent). *NIX systems typically do overcommit instead: calls to mmap will succeed, but may fail to actually provide the memory when it’s accessed (or may require another process to be killed).
In theory, overcommit would lead to significantly worse reliability. On memory-constrained systems, it definitely does but on large systems (modern mobile phones on up), it doesn’t. Most code doesn’t actually check for allocation failure (or, if it does, doesn’t have reliable error-handling paths). Allocation failure is sufficiently rare that it’s hard to handle. On Windows, it throws an SEH exception, which nothing expects and it percolates up the stack until you find a catchall. Then you discard one run loop iteration’s work and often end up with things in an undefined state (sorry, you dropped one outbound packet, but I’m sure nothing will be confused by the following one arriving). If you can’t gracefully recover from allocation failure then you get better reliability by making memory allocation failure rarer. The few things that do gracefully recover can happily pre-fault on systems with overcommit and use more RAM in exchange for better determinism.
The accounting mechanism means that you often exhaust commit long before you exhaust memory. My work machine had 128 GiBs of RAM and I would typically see allocation failures with 50 GiBs free when I initially provisioned it with ‘only’ 128 GiBs of swap space. Lots of processes allocate memory that they don’t touch and this consumes commit charge but not memory. Then there isn’t enough available commit charge to allow allocation and things start failing. Garbage collected languages often handle this by running the GC more aggressively, which hurts performance. Oh, and for extra fun, NT can deliver an event when memory is low, but this is delivered when memory is low, not when commit charge is nearly exhausted, so you can have allocations failing but the event isn’t triggered.
These decisions made perfect sense in the early ‘90s but haven’t for at least 15 years. Unfortunately, things like SQL Server make strong assumptions about the behaviour of the NT VM subsystem and so changing anything is likely to cause huge performance regressions. They also build their reliability guarantees on top of it (hint: if your reliability guarantees come from processes never crashing, you’ve done something wrong), so you can’t optimise for 99% of Windows programs without hurting one’s that make MS a lot of money.
Thanks for the fantastic, detailed comment.
Why did the early NT versions require so much more RAM than Windows 95? Was it due to inevitable overhead of being fully 32-bit, fully preemptively multitasked, and having strong memory protection and security? Or was there some overengineering as well?
Some of it was just page tables. NT moved a bunch of things into userspace services that each needed their own address space. That’s only a few tens of KiBs per process, but a dozen of those add up. Some of it came from being fully 32-bit (no 16-bit thunks). Some came from features that weren’t present in the 9x series, such as NTFS with ACL support, multi-user support, and access control on all kernel objects. I am not intimately familiar with NT of that era, but I don’t think it was particularly over-engineered. Its requirements were fairly similar to contemporary UNIX systems (it’s direct competitors).
Nt is the first modern kernel that’s still similar to what we have today. Yep, those features comes with a price, and the tradeoff is worth the cost.
Potentially solvable by porting the Linux port of SQL Server back onto Windows.
I don’t think it’ll actually happen but it would be very funny.
Edit: actually, thinking about it more, how do those assumptions affect performance of SQL server now that you actually can have it running on Linux? :)
There is no Linux port of SQL Server to Linux. There’s a port of a big chunk of the NT kernel and Win32 libraries to Linux (‘SQL PAL’, formerly known as Drawbridge), SQL Server runs on top of this.
Because SQL Server on Linux is SQL Server on Windows, with the NT kernel running as a user-mode Linux process. See Drawbridge.
It’s articles like these that sometimes make me wonder just how many applications out there are running 10x-100x slower than they need to because of some weird process priority minutia they never thought to dig into.
It would be nice if there was a feature to limit the amount of memory the application could allocate, not just how much of the allocated data can be in RAM at once. “Hey, Teams - you get 500 MB of RAM and no more. And Slack, 200 MB should be quite enough for chatting”. Maybe there is such a feature too?
You don’t really want to do it on Linux because malloc implementations typically make heavy use of overcommit. They’ll often ask for memory in 2-8 MiB chunks to amortise system call costs, knowing that the kernel won’t provide physical pages for all of it unless it’s used. Similarly, applications often allocate large buffers and rely on the same behaviour to allocate them when they’re use (allocating 1 MiB and reading 17 KiB into it is cheaper than reading 4 KiB at a time, resizing the underlying allocation, and stopping at the end). Address space is basically free on 64-bit platforms, only physical memory is expensive.
On Windows, you can limit the amount of commit that a process can charge, which has a similar effect, but Windows libraries tend to be somewhat more conservative about using up commit charge, because it’s a scarce resource.
I’m pretty sure with cgroups and comtainers this can be done. https://unix.stackexchange.com/questions/555080/using-cgroup-to-limit-program-memory-as-its-running
Can’t it be done with just ulimit?
I think the difficulty here is that ulimit only works on a single process.
Also I don’t think any of its tunables corresponds to commit charge but I’m not sure about that.
Sounds like classic Mac OS.
Yes, with a minor difference being that I’m pretty sure any modern application will crash and burn if it’s denied all the RAM it wants…
Looks like it’s possible to limit RAM on Windows, at least with the help of a tool: https://github.com/lowleveldesign/process-governor#limit-memory-of-a-process
That’s basically how it worked on the classic Mac OS too. Constant having to adjust the application’s allocation if it didn’t get enough.
The funny thing is, I’m not even sure why applications had to have a fixed allocation. The OS allocates all memory with a
void**
scheme, so it could relocate chunks anywhere, instead of having to have all app memory be contiguous.That’s a very good question. Classic MacOS didn’t have memory protection
so it couldn’t page unused stuff out of memory, afaik,but the double-pointer to all allocations should make it possible to shuffle stuff around…Perhaps it aided fragmentation? With a double-pointer you can move memory easily, but not split or coalesce it without the application having to help…
Edit: I have been informed that classic MacOS (and win3.x) did indeed have swapping, you have to lock a handle to operate on it and while it is unlocked its backing memory can be swapped out.
“Probably because of some creative scheme to conserve memory on the earliest Mac models, or maybe to make diagnosing problems easier” I guessed, and found some interesting history of Classic memory management: https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_management
I’d like to read a response from someone who worked on this feature or was there when it was implemented. I assume the team that implemented this in Windows 7 (or Vista?) wasn’t incompetent.
Vista. Vista was RAM constrained. Trimming working sets does mean sending memory from that process to the front of the queue in event of paging, and will reduce paging of a foreground process.
In Vista, the thing that sets this bit is the search indexer.
Presumably 32Mb was a good choice for the search indexer. It might have been a good choice for other background tasks too - when the whole system needs to run in 448Mb of RAM, a background process just can’t have a 64Mb working set. Unfortunately, one bit of state that controls a pile of values is not going to work well for arbitrary processes with arbitrary memory requirements, particularly 16 years later.
It’s normally a good idea to express this kind of limit as a fraction of total RAM. Most *NIX systems compute things like the default for the maximum number of file descriptors like this, for example. 10 years after you introduce them, RAM is so big that users don’t see the limits.
I think Bruce is trying to challenge this notion. If a program sweeps 64Mb of RAM, it doesn’t matter how much physical RAM is present, the program will degrade if it gets less. Whether a 4Gb system should allow a background process to run well in 64Mb or force it to run badly in 32Mb depends on what the rest of the system is doing, which is why he’s suggesting leaving the balance set manager to do its job.
If the system were actually paging, it might make sense to nudge background processes to page more to have foreground processes page less. But even that should be based on the memory needs of the programs, not the physical memory in the machine.
If you’re able to dynamically tune these things, that will always perform better than static heuristics (unless computing the dynamic properties costs so much that it slows the system down more than a bad heuristic). I suspect that having background processes go slowly on Vista with under 512 MiB of RAM was less bad than the alternatives. If this limit had grown with RAM size, it would probably be 128 MiB or more now and no one would notice it. If you have more cores now and have some idle cycles to spare computing better memory pressure metrics then you can tune it even better, but adding anything to the NT kernel’s virtual memory subsystem is definitely ‘here be dragons’ territory. I think there are approximately three people who understand that code, of whom two will claim that they don’t and only Landy really does.