Solaris Application Memory Management

by Xinfeng Liu, July 2009

For application developers, memory management is a key topic to learn because memory is an important resource for running programs. Each operating system has its own memory management mechanism, so learning about the characteristics of memory management will help application developers better use memory resources and avoid or resolve performance problems.

This article discusses how to observe memory usage on Solaris, how application memory is laid out on Solaris and how to use the libumem memory allocator to improve memory allocation efficiency. These are the basics for managing application memory on Solaris and resolving common memory usage problems.

Note: Debugging core dumps from incorrectly referenced memory and debugging memory leaks are beyond the scope of this article.

Overview

On several occasions, application developers have asked how Solaris manages memory and how to interpret various memory-related statistics. For example:

The system has enough physical memory, but why can't my application use it?
Does my system really lack physical memory, and how do I know the system is busy swapping in and swapping out?
Why does my application core-dump because of an "out of memory" error?
How much memory can my application use?
How many threads can my application have?
Why does the application encounter "stack overflow"?
Why is my multi-threaded C++ application so slow?
Why does libumem cause my application to core-dump unexpectedly?
How do I specify a shared-memory base address?

Addressing such problems requires detailed information about application memory management on Solaris. This article describes those details in two sections:

System memory
Application memory

A basic knowledge of the Solaris Operating System is assumed.

System Memory

As a developer, you need to understand memory usage at the system level, including physical memory, swap, and tmpfs. Solaris provides several commands for observing memory usage at the system level.

Physical Memory Usage

To find how much physical memory is installed on the system, use the prtconf command in Solaris. For example:

-bash-3.00# prtconf|grep Memory

Memory size: 1024 Megabytes

To find how much free memory is currently available in the system, use the vmstat command. Look at the free column (the unit is KB). For example:

-bash-3.00# vmstat 2

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr dd s1 -- --   in   sy   cs us sy id
 0 0 0 2265520 262208 1   8  1  1  1  0  0  1 -1  0  0  309 193691 310 41 31 28
 0 0 0 2175632 176072 3   8  0  0  0  0  0  0  0  0  0  326 278310 348 58 42 0
 0 0 0 2175632 176120 0   0  0  0  0  0  0  0  0  0  0  305 263986 408 56 44 0

The vmstat output shows that the system has about 176 Mbytes of free memory. In fact, on Solaris 8 or later, the free memory shown in the vmstat output includes the free list and the cache list. The free list is the amount of memory that is actually free. This is memory that has no association with any file or process. The cache list is also the free memory; it is the majority of the file system cache. The cache list is linked to the free list; if the free list is exhausted, then memory pages will be taken from the head of the cache list.

To determine if the system is running low on physical memory, look at the sr column in the vmstat output, where sr means scan rate. Under low memory, Solaris begins to scan for memory pages that have not been accessed recently and moves them to the free list. On Solaris 8 or later, a non-zero value of sr means the system is running low on physical memory.

You can also use vmstat -p to observe page in, page out, and page free activities. For example:

-bash-3.2# vmstat -p 2memory           page          executable      anonymous      filesystem
swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
1954016 1174028 2 19   4 2036 24    0    0    0    2    4    4    0    0    0
395300 32860  25 242 1089 1656 596  0    0    0  557 1089 1089    0    0    0
394752 32468  10 211 560 1344 380   0    0    0  572  560  560    0    0    0
394184 32820  18 241 1176 1092 1002 0    0    0  642 1176 1176    0    0    0
393712 33760  15 207 1806 888 4256  0    0    4  570 1792 1792    0    0   10

These page activities are categorized as executable ( epi, epo, epf), anonymous ( api, apo, apf), and filesystem ( fpi, fpo, fpf). Executable memory means the memory pages that are used for program and library text. Anonymous memory means the memory pages that are not associated with files. For example, anonymous memory is used for process heaps and stacks. Filesystem memory means the memory pages that are used for file I/O. When the process pages are swapping in or out, you will see a large number in the api and apo columns. When the system is busy reading files from the file system or writing files to the file system, you will see a large number in the fpi and fpo columns. Paging activities are not necessarily bad, but constantly paging out pages and bringing in new pages, especially when the free column is low, is bad for performance.

Swap Usage

To find what swap devices are configured in the system, use the swap -l command. For example:

-bash-3.00# swap -l

swapfile             dev  swaplo blocks   free
/dev/dsk/c0t0d0s1   136,9      16 4194224 4194224

To observe swap space usage, use the swap -s command. For example:

-bash-3.00# swap -s

total: 487152k bytes allocated + 104576k reserved = 591728k used, 2175608k available

Note that the available swap space in the output of swap -s command includes the amount of free physical memory.

On Solaris, you can, on the fly, remove swap devices with the swap -d command or add them with the swap -a command. Such changes do not persist across reboots. To make the changes persistent, modify /etc/vfstab by adding or removing entries for the swap devices.

High swap-space usage does not necessarily mean the system needs additional physical memory or that such usage is the reason for bad performance. High swapping in and out activities (observable with vmstat -p) can lead to performance problems: some processes have to wait for swapping activities to be finished before the processes run forward. Moreover, swapping is a single-threaded activity.

In some cases, you must also be aware of the available swap space. For example, the system runs hundreds or even thousands of Oracle session processes or Apache processes, and each process needs to reserve or allocate some swap space. In such cases, you must allocate an adequate swap device or add multiple swap devices.

Tmpfs

One difference between Solaris and other operating systems is /tmp, which is a nonpersistent, memory-based file system on Solaris (tmpfs). Tmpfs is designed for the situation in which a large number of short-lived files (like PHP sessions) need to be written and accessed on a fast file system. You can also create your own tmpfs file system and specify the size. See the man page for mount_tmpfs(1M).

Solaris also provides a ramdisk facility. You can create a ramdisk with ramdiskadm(1M) as a block device. The ramdisk uses physical memory only. By default, at most 25 percent of available physical memory can be allocated to ramdisks. The tmpfs file system uses virtual memory resources that include physical memory and swap space.

Large-sized files placed in tmpfs can affect the amount of memory space left over for program execution. Likewise, programs requiring large amounts of memory use up the space available to tmpfs. If you encounter this constraint (for example, running out of space on tmpfs), you can allocate more swap space by using the swap(1M) command. Avoid swapping in this case because swapping indicates shortage of physical memory and hurts performance even if swap space is sufficient.

Application Memory

Before attempting to resolve application memory problems, developers should first learn about address spaces. We briefly discuss that subject next.

Address Space

Each operating system has its own definition and break-up of the address spaces. Address space information for Solaris can be found from Solaris source code comments; since Solaris is an open-source operating system, this information is publicly available.

Note: All addresses mentioned in this article refer to virtual addresses, not physical addresses. Mapping virtual addresses to physical addresses is the responsibility of the operating system and the processor MMU (Memory Management Unit).

Solaris x86:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/startup.c

......

    359  *       
                    32-bit Kernel's Virtual Memory Layout.
    360  *              +-----------------------+
    361  *              |                       |
    362  * 0xFFC00000  -|-----------------------|- ARGSBASE
    363  *              |       debugger        |
    364  * 0xFF800000  -|-----------------------|- SEGDEBUGBASE
    365  *              |      Kernel Data      |
    366  * 0xFEC00000  -|-----------------------|
    367  *              |      Kernel Text      |
    368  * 0xFE800000  -|-----------------------|- KERNEL_TEXT (0xFB400000 on Xen)
    369  *              |---       GDT       ---|- GDT page (GDT_VA)
    370  *              |---    debug info   ---|- debug info (DEBUG_INFO_VA)
    371  *              |                       |
    372  *              |   page_t structures   |
    373  *              |   memsegs, memlists,  |
    374  *              |   page hash, etc.     |
    375  * ---         -|-----------------------|- ekernelheap, valloc_base (floating)
    376  *              |                       |  (segkp is just an arena in the heap)
    377  *              |                       |
    378  *              |       kvseg           |
    379  *              |                       |
    380  *              |                       |
    381  * ---         -|-----------------------|- kernelheap (floating)
    382  *              |        Segkmap        |
    383  * 0xC3002000  -|-----------------------|- segmap_start (floating)
    384  *              |       Red Zone        |
    385  * 0xC3000000  -|-----------------------|- kernelbase / userlimit (floating)
    386  *              |                       |                       ||
    387  *              |     Shared objects    |                       \/
    388  *              |                       |
    389  *              :                       :
    390  *              |       user data       |
    391  *              |-----------------------|
    392  *              |       user text       |
    393  * 0x08048000  -|-----------------------|
    394  *              |       user stack      |
    395  *              :                       :
    396  *              |       invalid         |
    397  * 0x00000000   +-----------------------+
...... 
    400  *            
                     64-bit Kernel's Virtual Memory Layout. (assuming 64 bit app)
    401  *                      +-----------------------+
    402  *                      |                       |
    403  * 0xFFFFFFFF.FFC00000  |-----------------------|- ARGSBASE
    404  *                      |       debugger (?)    |
    405  * 0xFFFFFFFF.FF800000  |-----------------------|- SEGDEBUGBASE
    406  *                      |      unused           |
    407  *                      +-----------------------+
    408  *                      |      Kernel Data      |
    409  * 0xFFFFFFFF.FBC00000  |-----------------------|
    410  *                      |      Kernel Text      |
    411  * 0xFFFFFFFF.FB800000  |-----------------------|- KERNEL_TEXT
    412  *                      |---       GDT       ---|- GDT page (GDT_VA)
    413  *                      |---    debug info   ---|- debug info (DEBUG_INFO_VA)
    414  *                      |                       |
    415  *                      |      Core heap        | (used for loadable modules)
    416  * 0xFFFFFFFF.C0000000  |-----------------------|- core_base / ekernelheap
    417  *                      |        Kernel         |
    418  *                      |         heap          |
    419  * 0xFFFFFXXX.XXX00000  |-----------------------|- kernelheap (floating)
    420  *                      |        segmap         |
    421  * 0xFFFFFXXX.XXX00000  |-----------------------|- segmap_start (floating)
    422  *                      |    device mappings    |
    423  * 0xFFFFFXXX.XXX00000  |-----------------------|- toxic_addr (floating)
    424  *                      |         segzio        |
    425  * 0xFFFFFXXX.XXX00000  |-----------------------|- segzio_base (floating)
    426  *                      |         segkp         |
    427  * ---                  |-----------------------|- segkp_base (floating)
    428  *                      |   page_t structures   |  valloc_base + valloc_sz
    429  *                      |   memsegs, memlists,  |
    430  *                      |   page hash, etc.     |
    431  * 0xFFFFFF00.00000000  |-----------------------|- valloc_base (lower if > 1TB)
    432  *                      |        segkpm         |
    433  * 0xFFFFFE00.00000000  |-----------------------|
    434  *                      |       Red Zone        |
    435  * 0xFFFFFD80.00000000  |-----------------------|- KERNELBASE (lower if > 1TB)
    436  *                      |     User stack        |- User space memory
    437  *                      |                       |
    438  *                      | shared objects, etc   |       (grows downwards)
    439  *                      :                       :
    440  *                      |                       |
    441  * 0xFFFF8000.00000000  |-----------------------|
    442  *                      |                       |
    443  *                      | VA Hole / unused      |
    444  *                      |                       |
    445  * 0x00008000.00000000  |-----------------------|
    446  *                      |                       |
    447  *                      |                       |
    448  *                      :                       :
    449  *                      |       user heap       |       (grows upwards)
    450  *                      |                       |
    451  *                      |       user data       |
    452  *                      |-----------------------|
    453  *                      |       user text       |
    454  * 0x00000000.04000000  |-----------------------|
    455  *                      |       invalid         |
    456  * 0x00000000.00000000  +-----------------------+
    457  *
    458  * A 32 bit app on the 64 bit kernel sees the same layout as on the 32 bit
    459  * kernel, except that userlimit is raised to 0xfe000000

Solaris SPARC:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/sun4/os/startup.c

The address space for the user-level part of the code is extracted below:

......

    493  *                    
                    32-bit User Virtual Memory Layout.
    494  *                       /-----------------------\
    495  *                       |                       |
    496  *                       |        invalid        |
    497  *                       |                       |
    498  *          0xFFC00000  -|-----------------------|- USERLIMIT
    499  *                       |       user stack      |
    500  *                       :                       :
    501  *                       :                       :
    502  *                       :                       :
    503  *                       |       user data       |
    504  *                      -|-----------------------|-
    505  *                       |       user text       |
    506  *          0x00002000  -|-----------------------|-
    507  *                       |       invalid         |
    508  *          0x00000000  _|_______________________|
    509  *
    510  *
    511  *
    512  *                    
                     64-bit User Virtual Memory Layout.
    513  *                       /-----------------------\
    514  *                       |                       |
    515  *                       |        invalid        |
    516  *                       |                       |
    517  *  0xFFFFFFFF.80000000 -|-----------------------|- USERLIMIT
    518  *                       |       user stack      |
    519  *                       :                       :
    520  *                       :                       :
    521  *                       :                       :
    522  *                       |       user data       |
    523  *                      -|-----------------------|-
    524  *                       |       user text       |
    525  *  0x00000000.01000000 -|-----------------------|-
    526  *                       |       invalid         |
    527  *  0x00000000.00000000 _|_______________________|

Note that the address spaces differ according to hardware platform (x86 or SPARC) and 32-bit or 64-bit kernel. A point worth noting is that for a 32-bit Solaris x86 kernel, the kernel and user space share the same 32-bit address space. That's the 4-Gbyte limit—the 32 power of 2 is 4 Gbytes, which is the maximum address for 32-bit address space. The kernel base or user limit address is a floating value, and the user-space limit is different on systems with different physical memory size. On a system with larger physical memory, the application can only use less memory (can be less than 2 Gbytes) under a 32-bit Solaris x86 kernel.

Fortunately, nowadays, even on Intel and AMD platforms, 64-bit processors are dominant. With a 64-bit Solaris x86 kernel (we call it Solaris x64), the 32-bit application address limit is 0xFE000000, which is nearly 4 Gbytes. The 64-bit application address limit is 0xFFFFFD80.00000000, which is large enough for the application.

With a 64-bit Solaris SPARC kernel, the 32-bit application address limit is 0xFFC00000, which is nearly 4 Gbytes. The 64-bit application address limit is 0xFFFFFFFF.80000000, which is also large enough for the application.

pmap tool

For application developers who need to know how an application uses memory, a useful tool is the pmap utility on Solaris. The pmap tool displays information about the address space of a process. Here is an example of a process on Solaris SPARC.

-bash-3.00$ pmap 8283|more

8283:   ./shm_test 10
00010000       8K r-x--  /export/home/lxf/work/shm/shm_test
00020000       8K rwx--  /export/home/lxf/work/shm/shm_test
00022000   19536K rwx--    [ heap ]
FD800000    9768K rwxsR    [ ism shmid=0xf ]
FE8FA000       8K rwx-R    [ stack tid=11 ]
FE9FA000       8K rwx-R    [ stack tid=10 ]
FEAFA000       8K rwx-R    [ stack tid=9 ]
FEBFA000       8K rwx-R    [ stack tid=8 ]
FECFA000       8K rwx-R    [ stack tid=7 ]
FEDFA000       8K rwx-R    [ stack tid=6 ]
FEEFA000       8K rwx-R    [ stack tid=5 ]
FEFFA000       8K rwx-R    [ stack tid=4 ]
FF0FA000       8K rwx-R    [ stack tid=3 ]
FF1FA000       8K rwx-R    [ stack tid=2 ]
FF220000      64K rwx--    [ anon ]
FF240000      64K rw---    [ anon ]
FF260000      64K rw---    [ anon ]
FF280000     888K r-x--  /lib/libc.so.1
FF36E000      32K rwx--  /lib/libc.so.1
FF376000       8K rwx--  /lib/libc.so.1
FF380000       8K rwx--    [ anon ]
FF390000      24K rwx--    [ anon ]
FF3A0000       8K r-x--  /platform/sun4u-us3/lib/libc_psr.so.1
FF3A4000       8K rwxs-    [ anon ]
FF3B0000     208K r-x--  /lib/ld.so.1
FF3F0000       8K r--s-  dev:136,8 ino:8414
FF3F4000       8K rwx--  /lib/ld.so.1
FF3F6000       8K rwx--  /lib/ld.so.1
FFBFC000      16K rwx--    [ stack ]
 total     30816K

You can see that the lowest address is the executable text, then heap, stacks, and libraries, then the stack of the main thread that has the highest address. In cases in which memory is shared with other processes, the permission flag would show the character "s", such as

FD800000 9768K rwxsR [ ism shmid=0xf ]

In addition, pmap -xs can output more information for each object (library, executable, stack, heap): the segment addresses, size, RSS (physical memory occupied), memory page size, and so forth. Note that the RSS information in the pmap output includes the physical memory shared with other processes, such as shared memory, executable, and libraries text.

pmap can help answer many questions about application memory usage. For example:

If the application core-dumps because of a wrong pointer reference, you can use pmap <corefile> and address space information to see which memory area the wrong pointer pointed to.
For a 32-bit application, how much heap size the application could use depends on other items: stack size, the number of threads, library, the executable, and so forth.
If you need to specify a shared-memory base address, you must be familiar with address spaces. The shared-memory base address should be lower than the addresses of loaded libraries and thread stacks already created, but higher than heap addresses. However, manually specifying a shared-memory base address is not encouraged in Solaris; see the man page for shmat(2). Using a fixed value for the shmaddr argument can adversely affect performance on certain platforms because of D-cache aliasing.

Application Threads and Stack Size

One common question is how much memory a thread stack can use. For a Solaris 32-bit application, each thread stack's default address space is 1 Mbyte. That means that by default it's impossible to create 4000 threads in a 32-bit Solaris pthread application because a 32-bit application's address space limit is 4 Gbytes.

For a Solaris 64-bit application, each thread stack's default address space is 2 Mbytes. The default stack size can meet the application needs in most cases. However, if an application uses large local array variable or has deep function call stacks, stack overflow could still occur.

In addition, you should distinguish the address space and the amount of memory used. In some cases, pmap or prstat shows that the 32-bit application process size is less than 4 Gbytes, but a core dump occurs from "out of memory"—usually because there's not enough address space for allocating more heap or creating new threads. A thread stack that might not use one megabyte of memory nevertheless occupies one megabyte of address space.

You can use pthread_attr_setstacksize(3C) to change the thread stack size. For Java applications, you can use the -Xss JVM option to change the stack size, for example, -Xss128K. You can leave more address space for heaps by reducing the stack size. Of course, you should ensure that no stack overflows.

For an application with OpenMP, the default thread stack address space created for a 32-bit application is 4 Mbytes; for a 64-bit application, 8 Mbytes.

Application Memory Allocators

C and C++ developers must manually manage memory allocation and free memory. The default memory allocator is in the libc library.

Libc

Note that after free()is executed, the freed space is made available for further allocation by the application and not returned to the system. Memory is returned to the system only when the application terminates. That's why the application's process size usually never decreases. But for a long-running application, the application process size usually remains in a stable state because the freed memory can be reused. If this is not the case, then most likely the application is leaking memory, that is, allocated memory is used but never freed when no longer in use and the pointer to the allocated memory is not tracked by the application—basically lost.

The default memory allocator in libc is not good for multi-threaded applications when a concurrent malloc or free operation occurs frequently, especially for multi-threaded C++ applications. This is because creating and destroying C++ objects is part of C++ application development style. When the default libc allocator is used, the heap is protected by a single heap-lock, causing the default allocator not to be scalable for multi-threaded applications due to heavy lock contentions during malloc or free operations. It's easy to detect this problem with Solaris tools, as follows.

First, use prstat -mL -p <process id> to see if the application spends much time on locks; look at the LCK column. For example:

-bash-3.2# prstat -mL -p 14052

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 14052 root     0.6 0.7 0.0 0.0 0.0  35 0.0  64 245  13 841   0 test_vector_/721
 14052 root     1.0 0.0 0.0 0.0 0.0  35 0.0  64 287   5 731   0 test_vector_/941
 14052 root     1.0 0.0 0.0 0.0 0.0  35 0.0  64 298   3 680   0 test_vector_/181
 14052 root     1.0 0.1 0.0 0.0 0.0  35 0.0  64 298   3  1K   0 test_vector_/549
....

It shows that the application spend about 35 percent of its time waiting for locks.

Then, using the plockstat(1M) tool, find what locks the application is waiting for. For example, trace the application for 5 seconds with process ID 14052, and then filter the output with the c++filt utility for demangling C++ symbol names. (The c++filt utility is provided with the Sun Studio software.) Filtering through c++filt is not needed if the application is not a C++ application.

-bash-3.2#  plockstat -e 5 -p 14052 | c++filt

Mutex block
Count     nsec   Lock                         Caller
-------------------------------------------------------------------------------
 9678 166540561 libc.so.1‘libc_malloc_lock   libCrun.so.1‘void operator 
 delete(void*)+0x26
 
 5530 197179848 libc.so.1‘libc_malloc_lock   libCrun.so.1‘void*operator 
 new(unsigned)+0x38
 
......

From the preceding, you can see that the heap-lock libc_malloc_lock is heavily contended for and is a likely cause for the scaling issue. The solution for this scaling problem of the libc allocator is to use an improved memory allocator like the libumem library.

Libumem

Libumem is a user-space port of the Solaris kernel memory allocator. Libumem was introduced since Solaris 9 update 3. Libumem is scalable and can be used in both development and production environments. Libumem provides high-performance concurrent malloc and free operations. In addition, libumem includes powerful debugging features that can be used for checking memory leaks and other illegal memory uses.

Some micro-benchmarks compare libumem versus libc with respect to malloc and free performance. In the micro-benchmark, the multi-threaded mtmalloc test allocates and frees block of memory of varying size. It then repeats the process, also over another fixed number of iterations. The result shows with 10 threads calling malloc or free, the performance with libumem is about 5 or 6 times better than the performance with the default allocator in libc. In some real-world applications, using libumem can improve performance for C++ multi-threaded applications from 30 percent to 5 times, depending on workload characteristics and the number of CPUs and threads.

You can use libumem in two ways:

Use the environment variable before starting the application.

For 32-bit applications:

LD_PRELOAD_32=/usr/lib/libumem.so; export LD_PRELOAD_32

For 64-bit applications:

LD_PRELOAD_64=/usr/lib/64/libumem.so; export LD_PRELOAD_64

Link the application with -lumem at compile time.

A common problem in using libumem is that the application can core-dump, whereas the application runs well using libc. The reason is that libumem has an internal audit mechanism by design. An application running well under libumem indicates that the application does well in managing memory. The unexpected core dumps are typically caused by free(), such as

Free an arbitrary address in library text.
Free the same address multiple times (double-free).
Free an address that is not returned by earlier malloc().

Debugging memory leaks and illegal memory uses is beyond the scope of this article. Refer to the man page for libumem(3LIB) for details.

In addition, note that you should not use multiple memory allocators in same process, such as malloc by libc and free by libumem or other allocators. The behavior is unpredictable.

Another problem in using libumem is that the process size is slightly larger than when libc is used. This is normal because libumem has sophisticated internal cache mechanisms. This restriction should not be a real problem; in fact, by its design, libumem is very efficient and doing very well in memory fragmentation control for small allocations and for freeing memory.

Summary

Memory management has many advanced topics. This article discussed one of them: the basics of application memory management on Solaris, such as observing system memory and swap usage, understanding tmpfs and ramdisk, learning about application memory layout and address limits, and using libumem to improve memory allocation performance. This knowledge can help developers determine when the system is running low on memory, how much memory the application can use, how to adjust stack size, how to use libumem to improve C++ application performance, why the application could possibly core-dump with libumem, and more.

For a deeper insight into Solaris memory management, read the book: Solaris Internals (2nd Edition).

References

Darryl Gove's book: Solaris Application Programming

Acknowledgement

Many thanks to Sun principal engineer Pallab Bhattacharya for reviewing the article and giving valuable comments.