cgmemtime measures the high-water RSS+CACHE memory usage of a process and its descendant processes.
To be able to do so it puts the process into its own cgroup.
For example process A allocates 10 MiB and forks a child B that allocates 20 MiB and that forks a child C that allocates 30 MiB. All three processes share a time window where their allocations result in corresponding RSS (resident set size) memory usage.
The question now is: How much memory is actually used as a result of running A?
Answer: 60 MiB
cgmemtime is the tool to answer such questions.
(It also measures the runtime.)
Date: 2013-08-23
Before running cgmemtime the first time one has to setup a hierarchy under /sys/fs/cgroup:
$ sudo ./cgmemtime --setup -g myusergroup --perm 775
Which creates by default:
/sys/fs/cgroup/memory/cgmemtime
Now you can use cgmemtime like this:
$ ./cgmemtime ./testa x 10 20 30
[..]
Child high-water RSS : 10720 KiB
Recursive and accumulated high-water RSS: 61824 KiB
Or to produce machine readable output:
$ ./cgmemtime -t ./testa x 10 20 30
It also has some options (cf. -h
).
cgmemtime runs on a Linux system that comes with cgroups support. For example Fedora 17 comes with cgroups (Control Groups) enabled by default. Every system using systemd has cgroups support.
For example Ubuntu 10.04 LTS does not have cgroups, but 12.04 should have it. RHEL/CentOS should provide cgroups support since version 6.
Other than that you need a C compiler, GNU make and the usual development headers.
Just:
$ make
Which creates cgmemtime
and testa
. testa
is a small forking
allocation test program.
There is a shell script that contains some test cases. After setting up the cgroup hierachy via
$ sudo ./cgmemtime --setup -g myusergroup --perm 775
you can run the test suite:
$ bash test.sh
The thing is that the child number and the accumulated number basically come from different subsystems in the kernel - which apparently have slightly different trade-offs/approximations of the RSS of a process.
A simple test case:
$ ./cgmemtime python -c 'import time; import os; print os.getpid(); time.sleep(300)'
24131
Child high-water RSS : 6296 KiB
Recursive and acc. high-water RSS+CACHE : 2724 KiB
The first number is consistent to what GNU time (/usr/bin/time
)
reports. With both GNU time/cgmemtime, the number doesn't come
from the cgroups subsystem.
You can also approximate it with something like:
$ awk '/Rss:/{ sum += $2 } END { print sum }' /proc/24131/smaps
6388
The 2nd number comes from the cgroups subsystem. You can approximate it via excluding some shared library mappings, e.g.:
$ grep '^[0-9a-f]\|Rss:' /proc/24131/smaps | tr -d '\n' \
| sed 's/ kB/ kB\n/g' | grep -v '.so' | sed 's/^.*Rss://' \
| awk '{a+=$1} END {print a}'
2760
Hypothesis: cgroups doesn't account for the shared library mappings and the effect is easy to demonstrate with Python because it loads such a large amount of shared libraries.
Don't hesitate to mail feedback (comments, questions, ...) to:
Georg Sauthoff <[email protected]>
The reported high-water RSS+CACHE usage values are as accurate as the
usage_in_bytes
value exported by the cgroup memory resource
controller.
The kernel documentation states:
For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz value for efficient access. (Of course, when necessary, it's synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).
We can't use memory.stat because it does not include high-water memory usage information.
Doing some tests with e.g. ./testa
the reported values seem to
be exact enough, though.
The usage_in_bytes
meassure reports the sum of RSS and CACHE
usage. Thus, you can't meassure the highwater RSS-without-CACHE
usage. In a program that does a lot of IO the CACHE part then
dominates the highwater RSS+CACHE value.
For example:
$ cgmemtime dd if=test.img | dd of=out
# vs.
$ cgmemtime dd if=test.img of=out
(for a large test.img the 2nd command has a large RSS+CACHE highwater value)
Currently, I am not aware of a cgroup way to just derive the RSS-only highwater mark.
Setup new cgroup (as root):
# cgcreate -t juser:juser -g memory:/juser-cgroup
No task should be part of that cgroup in the beginning:
$ cat /sys/fs/cgroup/memory/juser-cgroup/tasks
Highwater RSS usage - should be 0:
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.max_usage_in_bytes
Should report more accurate meassurements - but does not include highwater marks:
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.stat
Current RSS usage in that group:
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.usage_in_bytes
Add new task to the group:
$ cgexec -g memory:/juser-cgroup ./testa c 10 20 30 40
Should report about 100 MiB (because ./testa forks 3 times and the processes allocate different amounts of memory, i.e. 10. 20, 30 and 40 MiBs - at the same time):
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.max_usage_in_bytes
Resets the highwater mark:
# echo 0 > /sys/fs/cgroup/memory/juser-cgroup/memory.max_usage_in_bytes
New reset value. It is not to exactly 0 - the kernel documentation
mentions fuzz due to optimaztions of memory.usage_in_bytes
.
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.max_usage_in_bytes
And indeed, the above value should now equal this one:
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.usage_in_bytes
Since the cgroups does not have any tasks now, we can use:
# echo 0 > /sys/fs/cgroup/memory/juser-cgroup/memory.force_empty
Now the values should be both 0:
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.max_usage_in_bytes
$ cat /sys/fs/cgroup/memory/juser-cgroup/memory.usage_in_bytes
To remove the cgroup again:
# cgdelete memory:/juser-cgroup
Note that all the cg*
commands can be replaces with combinations of
mkdir
/chmod
/chown
/echo
commands that manipulate the filesystem under
/sys/fs/cgroup/memory/
.
There are also other tools available which measure memory usage of processes. One way to categorize them is a two-fold classification: tools that use polling and tools that don't.
In that context - when you are only interested in the high-water usage - polling is the inferior approach. As described in previous sections, cgmemtine does not use polling. At the time of writing, I am not aware of any other tool that uses Linux Control Groups for memory measurements.
- GNU time - uses something like
wait4()
orwaitpid()
andgetrusage()
, thus on systems where available it is able to display the high-water RSS usage of a single child process, when using the verbose mode. - tstime - uses the taskstructs API of the Linux kernel to get the high-water RSS and the highwater VMEM usage of a child. Does not follow descendant processes. Provides also a process monitor mode that displays stats for all exiting processes. But the taskstats API is kind of cumbersome to use and on current kernels only accessible as root.
- smem - Tool written in Python that analyses proc files
like
/proc/$$/smaps
and generates a memory usage report of one ore multiple processes for one point in time. It is designed to provide a system-wide view, but one can also filter processes (or even loaded libraries) by various criteria. Smem distributes shared memory between all dependent processes (the result is called proportional set size - PSS - of a process). It does not take swapped-out memory into account.
- memtime - Uses polling of
/proc/$PID/stat
to measure high-water RSS/VMEM usage of a child. It supports Linux and Solaris styles of/proc
. Polling is in general a sub-optimal solution (e.g. short-running processes are not accurately measured, it wastes resources etc.). memtime is not maintained and has 64 Bit issues (last release 2002). - tmem - Polls
/proc/$PID/status
, thus has access to more detailed memory measures, e.g. VmPeak, VmSize, VmLck, VmPin, VmHWM, VmRSS, VmData, VmStk, VmExe, VmLib, VMPTE and VMSwap. - memusg - Python script that polls the VmSize values
of a group of processes via the command
ps
and displays its high-water mark. That means that it forks/execsps
and parses its output 10 times a second. For a given command line it creates a new session (via setsid()) and executes it in that session. Thus, children of the watched process are likely part of that session, too. Memusg then sums the VMSize value of each process of that session up and returns the maximum when the session leader exits. Note, that this method is not reliable, because child processes may still be alive after the session leader has exited and they may also create new sessions during their runtime, thus escaping the measurement via memusg.