by Xinfeng Liu, July 2009
For application developers, memory management is a key topic to learn because memory is an important resource for running programs. Each operating system has its own memory management mechanism, so learning about the characteristics of memory management will help application developers better use memory resources and avoid or resolve performance problems.
This article discusses how to observe memory usage on Solaris, how application memory is laid out on Solaris and how to use the libumem memory allocator to improve memory allocation efficiency. These are the basics for managing application memory on Solaris and resolving common memory usage problems.
Note: Debugging core dumps from incorrectly referenced memory and debugging memory leaks are beyond the scope of this article.
On several occasions, application developers have asked how Solaris manages memory and how to interpret various memory-related statistics. For example:
Addressing such problems requires detailed information about application memory management on Solaris. This article describes those details in two sections:
A basic knowledge of the Solaris Operating System is assumed.
As a developer, you need to understand memory usage at the system level, including physical memory, swap, and tmpfs. Solaris provides several commands for observing memory usage at the system level.
To find how much physical memory is installed on the system, use the prtconf
command in Solaris. For example:
-bash-3.00# prtconf|grep Memory
Memory size: 1024 Megabytes
To find how much free memory is currently available in the system, use the vmstat
command. Look at the free
column (the unit is KB). For example:
-bash-3.00# vmstat 2
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr dd s1 -- -- in sy cs us sy id
0 0 0 2265520 262208 1 8 1 1 1 0 0 1 -1 0 0 309 193691 310 41 31 28
0 0 0 2175632 176072 3 8 0 0 0 0 0 0 0 0 0 326 278310 348 58 42 0
0 0 0 2175632 176120 0 0 0 0 0 0 0 0 0 0 0 305 263986 408 56 44 0
The vmstat
output shows that the system has about 176 Mbytes of free memory. In fact, on Solaris 8 or later, the free memory shown in the vmstat
output includes the free list and the cache list. The free list is the amount of memory that is actually free. This is memory that has no association with any file or process. The cache list is also the free memory; it is the majority of the file system cache. The cache list is linked to the free list; if the free list is exhausted, then memory pages will be taken from the head of the cache list.
To determine if the system is running low on physical memory, look at the sr
column in the vmstat
output, where sr
means scan rate. Under low memory, Solaris begins to scan for memory pages that have not been accessed recently and moves them to the free list. On Solaris 8 or later, a non-zero value of sr
means the system is running low on physical memory.
You can also use vmstat -p
to observe page in, page out, and page free activities. For example:
-bash-3.2# vmstat -p 2memory page executable anonymous filesystem
swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf
1954016 1174028 2 19 4 2036 24 0 0 0 2 4 4 0 0 0
395300 32860 25 242 1089 1656 596 0 0 0 557 1089 1089 0 0 0
394752 32468 10 211 560 1344 380 0 0 0 572 560 560 0 0 0
394184 32820 18 241 1176 1092 1002 0 0 0 642 1176 1176 0 0 0
393712 33760 15 207 1806 888 4256 0 0 4 570 1792 1792 0 0 10
These page activities are categorized as executable ( epi, epo, epf
), anonymous ( api, apo, apf
), and filesystem ( fpi, fpo, fpf
). Executable memory means the memory pages that are used for program and library text. Anonymous memory means the memory pages that are not associated with files. For example, anonymous memory is used for process heaps and stacks. Filesystem memory means the memory pages that are used for file I/O. When the process pages are swapping in or out, you will see a large number in the api
and apo
columns. When the system is busy reading files from the file system or writing files to the file system, you will see a large number in the fpi
and fpo
columns. Paging activities are not necessarily bad, but constantly paging out pages and bringing in new pages, especially when the free
column is low, is bad for performance.
To find what swap devices are configured in the system, use the swap -l
command. For example:
-bash-3.00# swap -l
swapfile dev swaplo blocks free
/dev/dsk/c0t0d0s1 136,9 16 4194224 4194224
To observe swap space usage, use the swap -s
command. For example:
-bash-3.00# swap -s
total: 487152k bytes allocated + 104576k reserved = 591728k used, 2175608k available
Note that the available swap space in the output of swap -s
command includes the amount of free physical memory.
On Solaris, you can, on the fly, remove swap devices with the swap -d
command or add them with the swap -a
command. Such changes do not persist across reboots. To make the changes persistent, modify /etc/vfstab
by adding or removing entries for the swap devices.
High swap-space usage does not necessarily mean the system needs additional physical memory or that such usage is the reason for bad performance. High swapping in and out activities (observable with vmstat -p
) can lead to performance problems: some processes have to wait for swapping activities to be finished before the processes run forward. Moreover, swapping is a single-threaded activity.
In some cases, you must also be aware of the available swap space. For example, the system runs hundreds or even thousands of Oracle session processes or Apache processes, and each process needs to reserve or allocate some swap space. In such cases, you must allocate an adequate swap device or add multiple swap devices.
One difference between Solaris and other operating systems is /tmp
, which is a nonpersistent, memory-based file system on Solaris (tmpfs). Tmpfs is designed for the situation in which a large number of short-lived files (like PHP sessions) need to be written and accessed on a fast file system. You can also create your own tmpfs file system and specify the size. See the man page for mount_tmpfs
(1M).
Solaris also provides a ramdisk facility. You can create a ramdisk with ramdiskadm
(1M) as a block device. The ramdisk uses physical memory only. By default, at most 25 percent of available physical memory can be allocated to ramdisks. The tmpfs file system uses virtual memory resources that include physical memory and swap space.
Large-sized files placed in tmpfs can affect the amount of memory space left over for program execution. Likewise, programs requiring large amounts of memory use up the space available to tmpfs. If you encounter this constraint (for example, running out of space on tmpfs), you can allocate more swap space by using the swap
(1M) command. Avoid swapping in this case because swapping indicates shortage of physical memory and hurts performance even if swap space is sufficient.
Before attempting to resolve application memory problems, developers should first learn about address spaces. We briefly discuss that subject next.
Each operating system has its own definition and break-up of the address spaces. Address space information for Solaris can be found from Solaris source code comments; since Solaris is an open-source operating system, this information is publicly available.
Note: All addresses mentioned in this article refer to virtual addresses, not physical addresses. Mapping virtual addresses to physical addresses is the responsibility of the operating system and the processor MMU (Memory Management Unit).
Solaris x86:
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/startup.c
......
359 *
32-bit Kernel's Virtual Memory Layout.
360 * +-----------------------+
361 * | |
362 * 0xFFC00000 -|-----------------------|- ARGSBASE
363 * | debugger |
364 * 0xFF800000 -|-----------------------|- SEGDEBUGBASE
365 * | Kernel Data |
366 * 0xFEC00000 -|-----------------------|
367 * | Kernel Text |
368 * 0xFE800000 -|-----------------------|- KERNEL_TEXT (0xFB400000 on Xen)
369 * |--- GDT ---|- GDT page (GDT_VA)
370 * |--- debug info ---|- debug info (DEBUG_INFO_VA)
371 * | |
372 * | page_t structures |
373 * | memsegs, memlists, |
374 * | page hash, etc. |
375 * --- -|-----------------------|- ekernelheap, valloc_base (floating)
376 * | | (segkp is just an arena in the heap)
377 * | |
378 * | kvseg |
379 * | |
380 * | |
381 * --- -|-----------------------|- kernelheap (floating)
382 * | Segkmap |
383 * 0xC3002000 -|-----------------------|- segmap_start (floating)
384 * | Red Zone |
385 * 0xC3000000 -|-----------------------|- kernelbase / userlimit (floating)
386 * | | ||
387 * | Shared objects | \/
388 * | |
389 * : :
390 * | user data |
391 * |-----------------------|
392 * | user text |
393 * 0x08048000 -|-----------------------|
394 * | user stack |
395 * : :
396 * | invalid |
397 * 0x00000000 +-----------------------+
......
400 *
64-bit Kernel's Virtual Memory Layout. (assuming 64 bit app)
401 * +-----------------------+
402 * | |
403 * 0xFFFFFFFF.FFC00000 |-----------------------|- ARGSBASE
404 * | debugger (?) |
405 * 0xFFFFFFFF.FF800000 |-----------------------|- SEGDEBUGBASE
406 * | unused |
407 * +-----------------------+
408 * | Kernel Data |
409 * 0xFFFFFFFF.FBC00000 |-----------------------|
410 * | Kernel Text |
411 * 0xFFFFFFFF.FB800000 |-----------------------|- KERNEL_TEXT
412 * |--- GDT ---|- GDT page (GDT_VA)
413 * |--- debug info ---|- debug info (DEBUG_INFO_VA)
414 * | |
415 * | Core heap | (used for loadable modules)
416 * 0xFFFFFFFF.C0000000 |-----------------------|- core_base / ekernelheap
417 * | Kernel |
418 * | heap |
419 * 0xFFFFFXXX.XXX00000 |-----------------------|- kernelheap (floating)
420 * | segmap |
421 * 0xFFFFFXXX.XXX00000 |-----------------------|- segmap_start (floating)
422 * | device mappings |
423 * 0xFFFFFXXX.XXX00000 |-----------------------|- toxic_addr (floating)
424 * | segzio |
425 * 0xFFFFFXXX.XXX00000 |-----------------------|- segzio_base (floating)
426 * | segkp |
427 * --- |-----------------------|- segkp_base (floating)
428 * | page_t structures | valloc_base + valloc_sz
429 * | memsegs, memlists, |
430 * | page hash, etc. |
431 * 0xFFFFFF00.00000000 |-----------------------|- valloc_base (lower if > 1TB)
432 * | segkpm |
433 * 0xFFFFFE00.00000000 |-----------------------|
434 * | Red Zone |
435 * 0xFFFFFD80.00000000 |-----------------------|- KERNELBASE (lower if > 1TB)
436 * | User stack |- User space memory
437 * | |
438 * | shared objects, etc | (grows downwards)
439 * : :
440 * | |
441 * 0xFFFF8000.00000000 |-----------------------|
442 * | |
443 * | VA Hole / unused |
444 * | |
445 * 0x00008000.00000000 |-----------------------|
446 * | |
447 * | |
448 * : :
449 * | user heap | (grows upwards)
450 * | |
451 * | user data |
452 * |-----------------------|
453 * | user text |
454 * 0x00000000.04000000 |-----------------------|
455 * | invalid |
456 * 0x00000000.00000000 +-----------------------+
457 *
458 * A 32 bit app on the 64 bit kernel sees the same layout as on the 32 bit
459 * kernel, except that userlimit is raised to 0xfe000000
Solaris SPARC:
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/sun4/os/startup.c
The address space for the user-level part of the code is extracted below:
......
493 *
32-bit User Virtual Memory Layout.
494 * /-----------------------\
495 * | |
496 * | invalid |
497 * | |
498 * 0xFFC00000 -|-----------------------|- USERLIMIT
499 * | user stack |
500 * : :
501 * : :
502 * : :
503 * | user data |
504 * -|-----------------------|-
505 * | user text |
506 * 0x00002000 -|-----------------------|-
507 * | invalid |
508 * 0x00000000 _|_______________________|
509 *
510 *
511 *
512 *
64-bit User Virtual Memory Layout.
513 * /-----------------------\
514 * | |
515 * | invalid |
516 * | |
517 * 0xFFFFFFFF.80000000 -|-----------------------|- USERLIMIT
518 * | user stack |
519 * : :
520 * : :
521 * : :
522 * | user data |
523 * -|-----------------------|-
524 * | user text |
525 * 0x00000000.01000000 -|-----------------------|-
526 * | invalid |
527 * 0x00000000.00000000 _|_______________________|
Note that the address spaces differ according to hardware platform (x86 or SPARC) and 32-bit or 64-bit kernel. A point worth noting is that for a 32-bit Solaris x86 kernel, the kernel and user space share the same 32-bit address space. That's the 4-Gbyte limit—the 32 power of 2 is 4 Gbytes, which is the maximum address for 32-bit address space. The kernel base or user limit address is a floating value, and the user-space limit is different on systems with different physical memory size. On a system with larger physical memory, the application can only use less memory (can be less than 2 Gbytes) under a 32-bit Solaris x86 kernel.
Fortunately, nowadays, even on Intel and AMD platforms, 64-bit processors are dominant. With a 64-bit Solaris x86 kernel (we call it Solaris x64), the 32-bit application address limit is 0xFE000000, which is nearly 4 Gbytes. The 64-bit application address limit is 0xFFFFFD80.00000000, which is large enough for the application.
With a 64-bit Solaris SPARC kernel, the 32-bit application address limit is 0xFFC00000, which is nearly 4 Gbytes. The 64-bit application address limit is 0xFFFFFFFF.80000000, which is also large enough for the application.
For application developers who need to know how an application uses memory, a useful tool is the pmap
utility on Solaris. The pmap
tool displays information about the address space of a process. Here is an example of a process on Solaris SPARC.
-bash-3.00$ pmap 8283|more
8283: ./shm_test 10
00010000 8K r-x-- /export/home/lxf/work/shm/shm_test
00020000 8K rwx-- /export/home/lxf/work/shm/shm_test
00022000 19536K rwx-- [ heap ]
FD800000 9768K rwxsR [ ism shmid=0xf ]
FE8FA000 8K rwx-R [ stack tid=11 ]
FE9FA000 8K rwx-R [ stack tid=10 ]
FEAFA000 8K rwx-R [ stack tid=9 ]
FEBFA000 8K rwx-R [ stack tid=8 ]
FECFA000 8K rwx-R [ stack tid=7 ]
FEDFA000 8K rwx-R [ stack tid=6 ]
FEEFA000 8K rwx-R [ stack tid=5 ]
FEFFA000 8K rwx-R [ stack tid=4 ]
FF0FA000 8K rwx-R [ stack tid=3 ]
FF1FA000 8K rwx-R [ stack tid=2 ]
FF220000 64K rwx-- [ anon ]
FF240000 64K rw--- [ anon ]
FF260000 64K rw--- [ anon ]
FF280000 888K r-x-- /lib/libc.so.1
FF36E000 32K rwx-- /lib/libc.so.1
FF376000 8K rwx-- /lib/libc.so.1
FF380000 8K rwx-- [ anon ]
FF390000 24K rwx-- [ anon ]
FF3A0000 8K r-x-- /platform/sun4u-us3/lib/libc_psr.so.1
FF3A4000 8K rwxs- [ anon ]
FF3B0000 208K r-x-- /lib/ld.so.1
FF3F0000 8K r--s- dev:136,8 ino:8414
FF3F4000 8K rwx-- /lib/ld.so.1
FF3F6000 8K rwx-- /lib/ld.so.1
FFBFC000 16K rwx-- [ stack ]
total 30816K
You can see that the lowest address is the executable text, then heap, stacks, and libraries, then the stack of the main thread that has the highest address. In cases in which memory is shared with other processes, the permission flag would show the character "s", such as
FD800000 9768K rwxsR [ ism shmid=0xf ]
In addition, pmap -xs
can output more information for each object (library, executable, stack, heap): the segment addresses, size, RSS (physical memory occupied), memory page size, and so forth. Note that the RSS information in the pmap
output includes the physical memory shared with other processes, such as shared memory, executable, and libraries text.
pmap
can help answer many questions about application memory usage. For example:
pmap
<corefile> and address space information to see which memory area the wrong pointer pointed to. shmat
(2). Using a fixed value for the shmaddr
argument can adversely affect performance on certain platforms because of D-cache aliasing.One common question is how much memory a thread stack can use. For a Solaris 32-bit application, each thread stack's default address space is 1 Mbyte. That means that by default it's impossible to create 4000 threads in a 32-bit Solaris pthread application because a 32-bit application's address space limit is 4 Gbytes.
For a Solaris 64-bit application, each thread stack's default address space is 2 Mbytes. The default stack size can meet the application needs in most cases. However, if an application uses large local array variable or has deep function call stacks, stack overflow could still occur.
In addition, you should distinguish the address space and the amount of memory used. In some cases, pmap
or prstat
shows that the 32-bit application process size is less than 4 Gbytes, but a core dump occurs from "out of memory"—usually because there's not enough address space for allocating more heap or creating new threads. A thread stack that might not use one megabyte of memory nevertheless occupies one megabyte of address space.
You can use pthread_attr_setstacksize
(3C) to change the thread stack size. For Java applications, you can use the -Xss
JVM option to change the stack size, for example, -Xss128K
. You can leave more address space for heaps by reducing the stack size. Of course, you should ensure that no stack overflows.
For an application with OpenMP, the default thread stack address space created for a 32-bit application is 4 Mbytes; for a 64-bit application, 8 Mbytes.
C and C++ developers must manually manage memory allocation and free memory. The default memory allocator is in the libc library.
Libc
Note that after free()
is executed, the freed space is made available for further allocation by the application and not returned to the system. Memory is returned to the system only when the application terminates. That's why the application's process size usually never decreases. But for a long-running application, the application process size usually remains in a stable state because the freed memory can be reused. If this is not the case, then most likely the application is leaking memory, that is, allocated memory is used but never freed when no longer in use and the pointer to the allocated memory is not tracked by the application—basically lost.
The default memory allocator in libc is not good for multi-threaded applications when a concurrent malloc or free operation occurs frequently, especially for multi-threaded C++ applications. This is because creating and destroying C++ objects is part of C++ application development style. When the default libc allocator is used, the heap is protected by a single heap-lock, causing the default allocator not to be scalable for multi-threaded applications due to heavy lock contentions during malloc or free operations. It's easy to detect this problem with Solaris tools, as follows.
First, use prstat -mL -p
<process id> to see if the application spends much time on locks; look at the LCK
column. For example:
-bash-3.2# prstat -mL -p 14052
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
14052 root 0.6 0.7 0.0 0.0 0.0 35 0.0 64 245 13 841 0 test_vector_/721
14052 root 1.0 0.0 0.0 0.0 0.0 35 0.0 64 287 5 731 0 test_vector_/941
14052 root 1.0 0.0 0.0 0.0 0.0 35 0.0 64 298 3 680 0 test_vector_/181
14052 root 1.0 0.1 0.0 0.0 0.0 35 0.0 64 298 3 1K 0 test_vector_/549
....
It shows that the application spend about 35 percent of its time waiting for locks.
Then, using the plockstat
(1M) tool, find what locks the application is waiting for. For example, trace the application for 5 seconds with process ID 14052, and then filter the output with the c++filt
utility for demangling C++ symbol names. (The c++filt
utility is provided with the Sun Studio software.) Filtering through c++filt
is not needed if the application is not a C++ application.
-bash-3.2# plockstat -e 5 -p 14052 | c++filt
Mutex block
Count nsec Lock Caller
-------------------------------------------------------------------------------
9678 166540561 libc.so.1‘libc_malloc_lock libCrun.so.1‘void operator
delete(void*)+0x26
5530 197179848 libc.so.1‘libc_malloc_lock libCrun.so.1‘void*operator
new(unsigned)+0x38
......
From the preceding, you can see that the heap-lock libc_malloc_lock
is heavily contended for and is a likely cause for the scaling issue. The solution for this scaling problem of the libc allocator is to use an improved memory allocator like the libumem library.
Libumem
Libumem is a user-space port of the Solaris kernel memory allocator. Libumem was introduced since Solaris 9 update 3. Libumem is scalable and can be used in both development and production environments. Libumem provides high-performance concurrent malloc
and free
operations. In addition, libumem includes powerful debugging features that can be used for checking memory leaks and other illegal memory uses.
Some micro-benchmarks compare libumem versus libc with respect to malloc
and free
performance. In the micro-benchmark, the multi-threaded mtmalloc test allocates and frees block of memory of varying size. It then repeats the process, also over another fixed number of iterations. The result shows with 10 threads calling malloc
or free
, the performance with libumem is about 5 or 6 times better than the performance with the default allocator in libc. In some real-world applications, using libumem can improve performance for C++ multi-threaded applications from 30 percent to 5 times, depending on workload characteristics and the number of CPUs and threads.
You can use libumem in two ways:
Use the environment variable before starting the application.
For 32-bit applications:
LD_PRELOAD_32=/usr/lib/libumem.so; export LD_PRELOAD_32
For 64-bit applications:
LD_PRELOAD_64=/usr/lib/64/libumem.so; export LD_PRELOAD_64
Link the application with -lumem
at compile time.
A common problem in using libumem is that the application can core-dump, whereas the application runs well using libc. The reason is that libumem has an internal audit mechanism by design. An application running well under libumem indicates that the application does well in managing memory. The unexpected core dumps are typically caused by free()
, such as
malloc()
.Debugging memory leaks and illegal memory uses is beyond the scope of this article. Refer to the man page for libumem
(3LIB) for details.
In addition, note that you should not use multiple memory allocators in same process, such as malloc
by libc and free
by libumem or other allocators. The behavior is unpredictable.
Another problem in using libumem is that the process size is slightly larger than when libc is used. This is normal because libumem has sophisticated internal cache mechanisms. This restriction should not be a real problem; in fact, by its design, libumem is very efficient and doing very well in memory fragmentation control for small allocations and for freeing memory.
Memory management has many advanced topics. This article discussed one of them: the basics of application memory management on Solaris, such as observing system memory and swap usage, understanding tmpfs and ramdisk, learning about application memory layout and address limits, and using libumem to improve memory allocation performance. This knowledge can help developers determine when the system is running low on memory, how much memory the application can use, how to adjust stack size, how to use libumem to improve C++ application performance, why the application could possibly core-dump with libumem, and more.
For a deeper insight into Solaris memory management, read the book: Solaris Internals (2nd Edition).
Many thanks to Sun principal engineer Pallab Bhattacharya for reviewing the article and giving valuable comments.