Overcoming the Windows 2GB Caching Limit

Email.Email weblog link
Blog this.Blog this
Mike Richardson

Mike Richardson
Mar. 11, 2006 11:34 PM

Atom feed for this author. RSS 1.0 feed for this author. RSS 2.0 feed for this author.

For years Windows developers have struggled to exceed the 2GB per-process memory limit, especially when attempting to architect large-scale caching systems. The .NET CLR and Java’s VM also suffer from this limitation when running on 32-bit systems. Common solutions, such as implementing popular COTS solutions like TimesTen or NCache, do not solve the problem either. In 32-bit systems, it is simply impossible to store a very large amount of data in a process. However, there are a number of options available, including a nice addition to the .NET 2.0 Remoting API, which could make designing such a system more feasible. First, I will illustrate the background of the problem followed by various solutions. The goal of this post is to present an architectural pattern to combat the 2GB limit in Windows and other runtime environments.

Typically, a Windows process running in the 2003 operating system environment can access up to 2GB of address space. This memory is split between actual physical memory and virtual memory. Basically, the more processes that are running on the system, the more memory will be committed to reach the full 2GB address space.

When memory consumption approaches the 2GB limit, the paging process increases and performance begins to degrade. In order to improve performance and memory utilization, Windows memory managers use a concept known as PAE (Physical Addressing Extension) on Intel chips, which basically reduces the need to swap the memory of the paging file. The client program is not aware of the actual memory size. Rather, all the management and allocation of the memory addressed by PAE is handled independently of the program accessing the memory. In order to enable extended memory support and utilize PAE, the /3GB switch must be enabled in the boot.ini file, which is illustrated below:

[boot loader]
[operating systems]
multi(0)disk(0)rdisk(0)partition(2)\WINDOWS="Windows Server 2003, Enterprise" /fastdetect /PAE

Even though PAE is enabled, the operating system is still based on 32-bit linear addressing. However, multiple processes can benefit from the increased memory because they are less likely to encounter physical memory restrictions and begin paging. Additionally, Windows applications can be modified to use the AWE API in order to allocate memory outside of the applications process space, essentially bypassing the 2GB constraint.

AWE is a set of application programming interfaces (APIs) to the memory manager functions that enables programs to address more memory than the 4GB that is available through standard 32-bit addressing. AWE enables programs to reserve physical memory as non-paged memory and then to dynamically map portions of the non-paged memory to the program's working set of memory. This process enables memory-intensive programs, such as large database systems, to reserve large amounts of physical memory for data without having to be paged in and out of a paging file for usage. Instead, the data is swapped in and out of the working set and reserved memory is in excess of the 4GB range. Additionally, the range of memory in excess of 4GB is exposed to the memory manager and the AWE functions by PAE. Without PAE, AWE cannot reserve memory in excess of 4GB.

Now let’s state the problem in terms of developing a large, highly scalable caching solution…

The purpose of caching in a Windows application is two make infrequently changing data readily available to an application. The actual cached data is usually co-located on the application server in order to increase performance. In addition, by enabling the application tier components to access data directly without having to make a database connection increases reliability and eliminates a single point of failure. Enabling caching on the application tier essentially creates a two-tier distributed application. Once data is retrieved from the database or other source, it can be cached so that excessive calls can be avoided. Most popular caching API’s (including .NET) also support sliding expiration of data items, callback functions to respond to underlying changes in the data store, and provide cache “controllers” for implementing common LRU and MRU algorithms.

If your application must cache data in a process larger than 2GB, it can be accessed in .NET by using unmanaged code and the AWE API. However, one major issue still exists and must be dealt with – performance. For example, if you use PAE and AWE to cache a table that is approximately 6GB in size, then accessing that data in any caching system will obviously degrade performance. The paging of memory will no doubt reduce performance and the disk seek times to locate the necessary data can also be quite long. Note that when I say you must cache more than 2GB of data, I am assuming that the data itself is already optimized. For example, I am assuming that you are using efficient lookup structures in your cache like StringDictionary and not just a bunch of ArrayList objects, or that you have already performed string interning to decrease the overall memory footprint.

The next step to effectively reduce the size of the data is to mimic the functionality of popular caching systems like NCache. For example, you can then split the data up into logical buckets, provided that you create some form of controller which “understands” where to locate the appropriate data. The trickery comes in when the need arises to access data between buckets. In .NET, the only method available to aggregate data between processes is to use .NET Remoting. If you are using the .NET 1.1 API, you may be out of luck if you are seeking the ultimate in performance; even with the binary formatter the fastest access method would be TCP/IP, which is not quite fast enough. The sheer number of CPU cycles needed to cross process boundaries would grind the caching system to a halt. Fortunately, the .NET 2.0 Remoting API introduced a new Remoting channel for IPC known as a named pipe.

By implementing IPC using a named pipe in addition to a “smart” cache controller and some LRU and MRU algorithms, it is possible to achieve high performance in systems which need to access more than 2GB of memory.

Of course there is one last point that I forgot to mention. If your project has a liberal budget, you can solve all your caching problems simply by buying a nice, high-end 64-bit machine.

Mike Richardson is a software architect currently specializing in developing highly scalable Microsoft.NET and J2EE applications.