Three GCC Flags for Analyzing Memory Usage

I learned about three new GCC compiler/linker flags this week that are helpful for monitoring a system’s memory usage. I was surprised that I hadn’t previously run across them, and it seems like other developers in my network didn’t know about these flags either! Perhaps this isn’t too out of the ordinary considering that two of the three flags don’t seem to be listed in the GCC manual’s compendium of options.

The three flags I want to highlight are:

  • --print-memory-usage, which gives us a breakdown of the memory used in each memory region defined in the linker file. This is especially useful for embedded systems that define multiple memory regions with different space constraints.
  • -fstack-usage, which generates .su files that can be used for worst case stack analysis.
  • -Wstack-usage, which can be used to generate compiler warnings if a function’s stack frame size exceeds a specified limit

–print-memory-usage

To get a general sense of firmware size, I typically rely on the size program, which gives us a view of the different program segment sizes:

arm-none-eabi-size src/applications/blinky/blinky_dongle text data bss dec hex filename 13512 392 500 14404 3844 src/applications/blinky/blinky_dongle 

Many embedded systems that we’ve worked on define multiple memory regions with different space constraints. In its simplest form, this is typically two regions: FLASH and RAM:

MEMORY { FLASH (rx) : ORIGIN = 0x1000, LENGTH = 0xff000 RAM (rwx) : ORIGIN = 0x20000008, LENGTH = 0x3fff8 } 

Using the --print-memory-usage linker flag, we can see the actual size breakdown of each of these regions when the binary is linked:

[3/3] Linking target src/applications/blinky/blinky_dongle Memory region Used Size Region Size %age Used FLASH: 13512 B 1020 KB 1.29% RAM: 892 B 262136 B 0.34% 
Note
You will say -Wl,--print-memory-usage if you’re using the compiler as a front-end for the link step.

On more complex projects, you might have even more memory regions. For example, your flash memory may be broken up to store the program and static files (such as a device tree). You might also have multiple RAM regions: a processor’s internal SRAM and external DRAM. Getting a granular breakdown of the initial breakdown of each region can be extremely helpful.

 Memory region Used Size Region Size %age Used rom: 31284 B 2 MB 1.49% data_tcm: 26224 B 32 KB 80.03% prog_tcm: 0 B 32 KB 0.00% ram: 146744 B 320 KB 44.78% sdram: 2 MB 2 MB 100.00% 
Warning
For --print-memory-usage to properly account for memory stored in two places (e.g., the contents of .data are stored in FLASH a part of the binary and copied to RAM on startup), you need to use the pattern .data : { ... } > RAM AT > FLASH in your linker script. The pattern of .data: AT(some_location) { ... } > RAM will produce incorrect results, with .data only being accounted for in RAM.

For more information, see: Configuring a Linker Script for Accurate Linker Memory Usage Reporting

-fstack-usage

The -fstack-usage flag is useful for analyzing the stack usage of individual functions. You can use this data to calculate a worst-case stack usage for your program. Knowing your worst-case stack usage is important, because it helps us properly configure the stack sizes for a system. Undersized stacks will see stack overflows, while overside stacks steal precious memory resources that could be used by other areas of the program.

This compiler flag should be added to each compilation unit you want to generate data for. For every object file that is built with this flag, a corresponding .su file will be generated. Inside of that file you will find a list of functions, the known number of bytes added by the functions stack frame, and a qualifier: staticdynamic, or dynamic bounded.

The static qualifier indicates that the frame size is known, and all local variables have a static size. The size reported for the function is accurate.

The dynamic qualifier indicates that the frame size is not static, which occurs primarily when local variables have a dynamic size. If dynamic is also qualified by bounded, then it means the number of bytes reported is a reliable maximum of the function stack utilization. If it is not bounded, do not rely on the reported number in your worst-case stack analysis.

Here’s example output from a simple nRF52 blinky application:

driver.hpp:154:22:virtual embvm::DriverBase& embvm::DriverBase::operator++() 0 static led.hpp:207:7:void embvm::led::gpio<TActiveHigh>::off() [with bool TActiveHigh = false] 8 static led.hpp:201:7:void embvm::led::gpio<TActiveHigh>::on() [with bool TActiveHigh = false] 8 static led.hpp:190:7:void embvm::led::gpio<TActiveHigh>::start_() [with bool TActiveHigh = false] 16 static led.hpp:196:7:void embvm::led::gpio<TActiveHigh>::stop_() [with bool TActiveHigh = false] 8 static virtual_platform.hpp:36:20:void __tcf_0() 0 static nrf_gpio.hpp:53:7:void nRFGPIOOutput<TPort, TPin>::set(bool) [with unsigned char TPort = 0; unsigned char TPin = 6] 0 static nrf_gpio.hpp:53:7:void nRFGPIOOutput<TPort, TPin>::set(bool) [with unsigned char TPort = 0; unsigned char TPin = 8] 0 static nrf_gpio.hpp:53:7:void nRFGPIOOutput<TPort, TPin>::set(bool) [with unsigned char TPort = 1; unsigned char TPin = 9] 0 static nrf_gpio.hpp:53:7:void nRFGPIOOutput<TPort, TPin>::set(bool) [with unsigned char TPort = 0; unsigned char TPin = 12] 0 static nrf_gpio.hpp:71:7:void nRFGPIOOutput<TPort, TPin>::stop_() [with unsigned char TPort = 0; unsigned char TPin = 6] 0 static nrf_gpio.hpp:71:7:void nRFGPIOOutput<TPort, TPin>::stop_() [with unsigned char TPort = 0; unsigned char TPin = 8] 0 static nrf_gpio.hpp:71:7:void nRFGPIOOutput<TPort, TPin>::stop_() [with unsigned char TPort = 1; unsigned char TPin = 9] 0 static nrf_gpio.hpp:71:7:void nRFGPIOOutput<TPort, TPin>::stop_() [with unsigned char TPort = 0; unsigned char TPin = 12] 0 static nrf_gpio.hpp:66:7:void nRFGPIOOutput<TPort, TPin>::start_() [with unsigned char TPort = 0; unsigned char TPin = 6] 0 static nrf_gpio.hpp:66:7:void nRFGPIOOutput<TPort, TPin>::start_() [with unsigned char TPort = 0; unsigned char TPin = 8] 0 static nrf_gpio.hpp:66:7:void nRFGPIOOutput<TPort, TPin>::start_() [with unsigned char TPort = 1; unsigned char TPin = 9] 0 static nrf_gpio.hpp:66:7:void nRFGPIOOutput<TPort, TPin>::start_() [with unsigned char TPort = 0; unsigned char TPin = 12] 0 static nrf_gpio.hpp:51:2:nRFGPIOOutput<TPort, TPin>::~nRFGPIOOutput() [with unsigned char TPort = 0; unsigned char TPin = 6] 8 static nrf_gpio.hpp:51:2:nRFGPIOOutput<TPort, TPin>::~nRFGPIOOutput() [with unsigned char TPort = 0; unsigned char TPin = 8] 8 static nrf_gpio.hpp:51:2:nRFGPIOOutput<TPort, TPin>::~nRFGPIOOutput() [with unsigned char TPort = 1; unsigned char TPin = 9] 8 static nrf_gpio.hpp:51:2:nRFGPIOOutput<TPort, TPin>::~nRFGPIOOutput() [with unsigned char TPort = 0; unsigned char TPin = 12] 8 static led.hpp:188:2:embvm::led::gpio<TActiveHigh>::~gpio() [with bool TActiveHigh = false] 8 static led.hpp:188:2:embvm::led::gpio<TActiveHigh>::~gpio() [with bool TActiveHigh = false] 8 static led.hpp:213:7:void embvm::led::gpio<TActiveHigh>::toggle() [with bool TActiveHigh = false] 8 static nrf52_dongle_hw_platform.hpp:46:7:void NRF52DongleHWPlatform::startBlink() 72 static blinky.cpp:5:5:int main() 48 static 

Now, we can’t easily generate a worst-case analysis without having a corresponding call graph that we can use to calculate stack depth and total frame size. The original paper outlining -fstack-usage mentions an -fcallgraph-info flag that can be used alongside -fstack-usage to generate worst-case analysis, but this doesn’t appear to be available in any of our installed versions of GCC.

Two potential pre-built solutions that you can leverage are avstack.pl (doesn’t work well for C++) and WorstCaseStack (currently doesn’t run on my system).

You could also manually generate call graphs using a tool like cflow or compiler flags like -fdump-tree-optimized/-fdump-rtl-dfinish to generate the call graph data, while parsing the .su files to determine the stack usage for each function.

If you’re aware of any tools that can be used alongside -fstack-usage, let us know in the comments!

-Wstack-usage

You can generate warnings for stack usage using the -Wstack-usage flag, which takes in a byte-size as an argument (Wstack-usage=byte-size). If the stack usage of a function exceeds (or might exceed, for dynamic frames) the specified limit, you will see a compiler warning. This flag can be used separately from -fstack-usage.

-Wstack-usage=PTRDIFF_MAX is enabled by default. This is too large of a window to be valuable on an embedded system, so we typically use a smaller value that is tuned relative to our system’s stack sizes, such as 128 bytes. A function consuming 128 bytes on the stack may not be a problem itself, but it does indicate that we need to take a look at the function.

For example, if I compiled the program above with -Wstack-usage=48, the compiler helpfully warns me about startBlink(), which has a stack frame size of 72 bytes:

In file included from ../src/platforms/blinky/blinky_nrf52dongle/platform.hpp:6, from ../src/applications/blinky/blinky.cpp:1: ../src/hw_platforms/nordic/nrf52840_dongle/nrf52_dongle_hw_platform.hpp: In member function 'void NRF52DongleHWPlatform::startBlink()': ../src/hw_platforms/nordic/nrf52840_dongle/nrf52_dongle_hw_platform.hpp:46:7: warning: stack usage is 72 bytes [-Wstack-usage=] 46 | void startBlink() noexcept | ^~~~~~~~~~ 
Note
The warning message will be different if you have a dynamic or dynamic bounded stack frame.

You can disable the warning by specifying a byte-size that matches BYTE_MAX or by using -Wno-stack-usage.

Putting it All Together

Keeping an eye on memory usage is essential when building memory-constrained embedded systems. It’s much easier to notice potential memory problems with tooling than it is to debug strange behavior during runtime.

While these tools aren’t perfect, they do help us identify potential memory problems before we try running our programs on the target hardware.

Further Reading

4 Replies to “Three GCC Flags for Analyzing Memory Usage”

  1. Yes. The -fcallgraph-info doesn’t seem to work. Atleast in gcc 9.3.1. Looks like ARM provides this option in armcc 6 and not in arm-none-eabi-gcc.

  2. If you have a Segger debugger you can use Ozone’s “Call Graph” feature. It gives a lot of useful information about stack usage. Even from library functions. I recently hade great use of this when designing stacks for an RTOS:

    Anyone who don’t have a Segger debugger should get one 😉

  3. Doxygen is capable of generating call graphs, although that’s not it’s primary function. I know at least one of the output forms is the .dot format used by Graphviz, for which exist libraries allowing interacting with them.`

  4. I’m contuning work on avstack.pl from Windows so I’ve started by porting it to a python module here https://github.com/JPHutchins/avstack/tree/main/port

    So far there is a unit test the shows parity with avstack.pl as well as a helper script to generate all the .o/.obj args you’d need. Windows/Linux/Max compatible. Also considering looking into:

    https://github.com/HBehrens/puncover
    and
    https://github.com/PeterMcKinnis/WorstCaseStack
    To expand features.

    Tantalizing lead here https://developer.arm.com/documentation/ka004928/1-0 but GCC linker isn’t jamming with this. It’s unfortunate, because the callgraph is only “real” after it’s linked. For that reason I’m considering looking at the map files but I don’t want to reinvent the wheel and I don’t want to go down a path that becomes obsolete or unsupported!

Share Your Thoughts

This site uses Akismet to reduce spam. Learn how your comment data is processed.