Analyzing a SIGFPE in C, modern core dumps

The short version:

  1. Enable symbols in compile with -g (make sure the build doesn’t strip them)
  2. Make sure core dumps are being written.
  3. If you’re on ubuntu, apport-unpack them
  4. Have gdb tell you where the fault is.

Tutorial:

Recently during my live coding show, as part of a larger project, I decided to work with ffmpeg and libav to extract raw YUV keyframes, in an effort to get better OCR results from a video, on the theory that the color components (U and V) plus the transform to the RGB color space (and quantization error) probably wasn’t helping the letter recognition.

During the show, one of my usual viewers was inspired, and decided to write their own direct video processing utility. I tried searching for a helpful link on how to analyze where the math trouble was, but didn’t find much that I was willing to send him, so here we go with a quick tutorial on the hope that it may be useful to others as well!

The chat member described the issue as they were getting a SIGFPE. My first reaction was, “it’s probably a divide by zero or a bad log or something”, but given how much code was involved, the chatter wasn’t sure where to look. Back in the day, a unix system would just drop a core file which would tell you, but that isn’t the default anymore on ubuntu.

So, let’s start with a program that causes the same symptoms, save this as testfpe.c:

#include <stdio.h>

int main(int argc, char **argv) {
    printf("%d\n", 1 / 0);
    return 0;
}

Sure, it’s easy to see where the divide by zero is here, but when the program is 10,000 lines long, it becomes a pain quickly to spot the issue.

Step 0 - Know there’s a problem

Compile it up (including symbols with -g) and run it:

$ gcc -g testfpe.c -o testfpe
testfpe.c: In function ‘main’:
testfpe.c:5:22: warning: division by zero [-Wdiv-by-zero]
    5 |     printf("%d\n", 1 / 0);
      |
# alright, gcc spotted this one, but in the chat members case, it didn't:
$ gcc -w -g testfpe.c -o testfpe
$
# no warning this time!
$ ./testfpe 
Floating point exception (core dumped)
$ ls
testfpe  testfpe.c
# ok, maybe core dumped, but where is it?

Step 1 - Set core dump sizes

Well, no core dump file in the current directory, it’s probably just the default ubuntu / bash limit on core dump sizes being set to zero!

$ ulimit -c
0
# ah ha! all we need to do is set the core dump limit to something big!
$ ulimit -c unlimited
$
# did that work? let's look:
$ ulimit -c
unlimited
# ok, let's generate that core file:
$ ./testfpe
Floating point exception (core dumped)
$ ls
testfpe  testfpe.c
# what? where is it?

Step 2 - Check how the kernel dumps core

No core file present here. Let’s see how the kernel is configured to drop a core dump - I’ll use the sysctl interface, but the same info is in /proc:

$ sysctl kernel.core_pattern
kernel.core_pattern = |/usr/share/apport/apport %p %s %c %d %P %E
# interesting - /usr/share/apport/apport is handling the core dump

Ubuntu probably decided at some point to start collecting packaged programs that were crashing with a core dump more seriously than just leaving the files collecting dust on people’s computers. They built apport to help with alerting the user in a graphical environment that a program did this, as well as collect the relevant core and runtime info needed to debug it, and give the user the option to ship that off to Canonical for analysis, similar to abrt and other distros. By default, this doesn’t drop core files for user compiled programs, at leats on my install.

Step 3 - Let apport know you want the core file

To get the core file from apport (assuming you don’t want to change the kernel core pattern for your system), you need to first configure it to drop core dumps for unpackaged binaries like testfpe, so we can pick it up later:

# assuming you don't have an apport config,
#  (likely if you are reading this), let's create one:
$ mkdir -p ~/.config/apport
# now vim / ed / joe / bb / emacs / gedit / nano / whatever
#  to build the settings:
printf "[main]\nunpackaged=true\n" > ~/.config/apport/settings

Step 4 - Re-run the program and pick up core

Apport will now drop core files based on the executable name / path in /var/crash. We’ll need to unroll the packaged data (including the core) before we can use it with a debugger like gdb:

$ ./testfpe
Floating point exception (core dumped)
# if you're in X11, you'll likely have a popup that may be hidden
#  such as: "Sorry, the application testfpe has stopped unexpectedly"
#  "Send problem report to the developers?"
#  since that's us in this case, click "Don't send"
# look for the package including the core dump:
$ ls /var/crash
_apps_testfpe_testfpe.1000.crash
# you may have other stuff in there as well

Great, now we need to extract the core file from the package:

# this will unpack it in the sub-directory "corepackage":
$ apport-unpack /var/crash/_apps_testfpe_testfpe.1000.crash corepackage
$ ls -F
corepackage/  testfpe*  testfpe.c
# now we should have corepackage/CoreDump

Step 5 - Load it into gdb

Tell the debugger where to find the executable and the core dump:

gdb ./testfpe corepackage/CoreDump
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./testfpe...
[New LWP 239439]
Core was generated by `./testfpe'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x000056383906a167 in main (argc=1, argv=0x7ffef1308a58) at testfpe.c:5
5           printf("%d\n", 1 / 0);
(gdb)

Or, if we ignore the banner and the extra info:

Core was generated by `./testfpe'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x000056383906a167 in main (argc=1, argv=0x7ffef1308a58) at testfpe.c:5
5           printf("%d\n", 1 / 0);

If the exact line of C doesn’t show up, you probably need to recompile with -g to generate symbols, or if you’re in a larger project, make sure that strip is not stripping the symbols away as a build step (common in release software).

There’s our SIGFPE, with the exact line that produced it. Certainly does make debugging a lot easier in a large program!