PY-MA An Embedded Error Recovery and Debugging Mechanism for Scripting Language Extensions David M. Beazley Department of Computer Science The University of Chicago beazley@cs.uchicago.edu June 29, 2001
Outline Scripting language extensions and debugging Wrapped Application Debugger (WAD) Discussion
Scripting Language Extensions Scripting languages • Tcl, Perl, Python, Ruby, Guile, PHP etc. , Scripting extensions • Extend scripting interpreters with compiled functions and types (in C/C++) • Provide access to popular libraries and existing software • Performance optimization • A variety of automated tools can help (e.g., SWIG) Benefits • Scripting languages make a lot of things easier • User interface development • Web programming • Rapid prototyping, debugging, etc. • Plus, scripting languages are fun to use
Example Scripting Interpreter Dynamic Loading foo.so bar.so ... blah.so Extension building • Extensions packaged as shared libraries • Dynamic loading used to import modules • Contents of extension modules add new commands or types • Everything lives in the same process • Interpreter responsible for high-level control Examples • GUI built with Tcl/Tk • Scientific/engineering software (e.g., MATLAB)
A More Complex Example Apache libphp3.so mod_perl.so ... mod_python.so Embedded Python Interpreter Compiled Extensions MySQLdb.so blah.so .py .py .py .py .py scripts • May have multiple interpreters and a mix of C code and scripts • Also may include IPC, sockets, subprocesses, threads, etc. • Bottom line: a much more complicated runtime environment
Scripting Language Errors If you make a mistake in a script, you get an error % python spam.py Traceback (most recent call last): File "spam.py", line 15, in ? blah() File "spam.py", line 12, in blah bar() File "spam.py", line 9, in bar foo() File "spam.py", line 6, in foo spam(3) File "spam.py", line 3, in spam doh(n) NameError: There is no variable named 'doh' To fix: • Just look at the error message and make a change • Run a script-level debugger
Errors in Extension Code Uh oh... % python spam.py Segmentation Fault (core dumped) or % python spam.py Bus Error (core dumped) or % python spam.py Assertion failed: n > 0, file debug.c, line 54 Abort (core dumped) Obviously something “bad” has happened • Even worse: May get no error message at all (mod_python) • May not get a core file either
Common Failure Modes Uninitialized Data • Improper initialization of libraries (forgot to call an initialization function?) • Calling functions in the wrong order? Improper argument checking • Passing of NULL pointers • Improper conversion of scripting objects to C Failed assertions • Extensive use of assert(). Weird stuff • Illegal instructions • Bus error. Memory alignment problems Math errors • Floating point exception (divide by zero?). To fix: • Trial and error with print statements? • Cursing?
GDB Traceback (gdb) where #31 0xff165544 in Tcl_UplevelObjCmd (dummy=0x1, interp=0x20e658, objc=1, #0 0xff1d9bf0 in __sigprocmask () from /usr/lib/libthread.so.1 objv=0x24ec80) #1 0xff1ce628 in _resetsig () from /usr/lib/libthread.so.1 at ./../generic/tclProc.c:614 #2 0xff1cdd18 in _sigon () from /usr/lib/libthread.so.1 #32 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0b70) #3 0xff1d0e8c in _thrp_kill () from /usr/lib/libthread.so.1 at ./../generic/tclExecute.c:845 #4 0xfee49b10 in raise () from /usr/lib/libc.so.1 #33 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x274d50, flags=0) #5 0xfee3512c in abort () from /usr/lib/libc.so.1 at ./../generic/tclBasic.c:2723 #6 0xfee353d0 in _assert () from /usr/lib/libc.so.1 #34 0xff165afc in TclObjInterpProc (clientData=0x1, interp=0x20e658, objc=0, #7 0xfeee13ec in abort_crash () from /u0/beazley/Projects/WAD/WAD/Test/./ objv=0xffbeebd8) at ./../generic/tclProc.c:1001 debugmodule.so #35 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0xffbeebd8, #8 0xfeee28ec in _wrap_abort_crash () command=0xffbef024 "n tkButtonUp .1907556n", length=25, flags=0) from /u0/beazley/Projects/WAD/WAD/Test/./debugmodule.so at ./../generic/tclParse.c:932 #9 0x281c8 in call_builtin (func=0x1cc4f0, arg=0x1f9424, kw=0x0) at ceval.c:2650 #36 0xff15e7d0 in Tcl_EvalEx (interp=0x20e658, #10 0x28094 in PyEval_CallObjectWithKeywords (func=0x1cc4f0, arg=0x1f9424, kw=0x0) script=0xffbef024 "n tkButtonUp .1907556n", numBytes=25, flags=-4264800) at ceval.c:2618 at ./../generic/tclParse.c:1393 #11 0x26764 in eval_code2 (co=0x1d37e0, globals=0x0, locals=0x1d37cf, args=0x1cc4f0, #37 0xff15e9c0 in Tcl_Eval (interp=0x20e658, argcount=1762552, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at string=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclParse.c:1512 ceval.c:1951 #38 0xff1243d0 in Tcl_GlobalEval (interp=0x20e658, #12 0x263a0 in eval_code2 (co=0x1d3858, globals=0x0, locals=0x1cc4f0, args=0x19b1a4, command=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclBasic.c:4139 argcount=1883008, kws=0x1d7318, kwcount=0, defs=0x0, defcount=0, owner=0x0) #39 0xff221a40 in Tk_BindEvent (bindingTable=0xffbef024, eventPtr=0x29ffa0, at ceval.c:1850 tkwin=0x2790a8, numObjects=2045728, objectPtr=0xffbef170) at ./../generic/ #13 0x263a0 in eval_code2 (co=0x1d3e50, globals=0x0, locals=0x19b1a4, args=0x1a7374, tkBind.c:1784 argcount=1883128, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at #40 0xff226450 in TkBindEventProc (winPtr=0x2790a8, eventPtr=0x29ffa0) ceval.c:1850 at ./../generic/tkCmds.c:244 #14 0x285e0 in call_function (func=0x1a73a4, arg=0x18f114, kw=0x0) at ceval.c:2772 #41 0xff22c218 in Tk_HandleEvent (eventPtr=0x29ffa0) at ./../generic/tkEvent.c:737 #15 0x28080 in PyEval_CallObjectWithKeywords (func=0x1a73a4, arg=0x18f114, kw=0x0) #42 0xff22c61c in WindowEventProc (evPtr=0x29ff98, flags=-1) at ./../generic/ at ceval.c:2616 tkEvent.c:1072 #16 0x680b0 in builtin_apply (self=0x0, args=0x0) at bltinmodule.c:88 #43 0xff15bb54 in Tcl_ServiceEvent (flags=-1) at ./../generic/tclNotify.c:607 #17 0x281c8 in call_builtin (func=0x1910c8, arg=0x1f9b54, kw=0x0) at ceval.c:2650 #44 0xff15beec in Tcl_DoOneEvent (flags=-1) at ./../generic/tclNotify.c:846 #18 0x28094 in PyEval_CallObjectWithKeywords (func=0x1910c8, arg=0x1f9b54, kw=0x0) #45 0x99314 in EventHook () at ./_tkinter.c:2020 at ceval.c:2618 #46 0xbaf30 in rl_read_key () at input.c:374 #19 0x26764 in eval_code2 (co=0x1f3948, globals=0x0, locals=0x1f38f0, args=0x1910c8, #47 0xac920 in readline_internal_char () at readline.c:454 argcount=1733540, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x2436e4) #48 0xaca64 in readline_internal_charloop () at readline.c:507 at ceval.c:1951 #49 0xaca94 in readline_internal () at readline.c:521 #20 0x285e0 in call_function (func=0x24374c, arg=0x1a606c, kw=0x0) at ceval.c:2772 #50 0xac704 in readline (prompt=0x1cbd9c ">>> ") at readline.c:349 #21 0x28080 in PyEval_CallObjectWithKeywords (func=0x261414, arg=0x18f114, kw=0x0) #51 0x8249c in call_readline (prompt=0x1cbd9c ">>> ") at ./readline.c:462 at ceval.c:2616 #52 0x21ae0 in PyOS_Readline (prompt=0x1cbd9c ">>> ") at myreadline.c:118 #22 0x98064 in PythonCmd (clientData=0x1cc8e0, interp=0x20e658, argc=0, #53 0x205a0 in tok_nextc (tok=0x27abd0) at tokenizer.c:192 argv=0xffbee060) #54 0x20fb4 in PyTokenizer_Get (tok=0x27abd0, p_start=0xffbef8c4, p_end=0xffbef8c0) at ./_tkinter.c:1274 at tokenizer.c:516 #23 0xff122064 in TclInvokeStringCommand (clientData=0x278538, interp=0x20e658, #55 0x20274 in parsetok (tok=0x27abd0, g=0x17026c, start=256, err_ret=0xffbef9b0) objc=1, at parsetok.c:128 objv=0x24ec84) at ./../generic/tclBasic.c:1752 #56 0x20158 in PyParser_ParseFile (fp=0x18ebe8, filename=0xbf628 "<stdin>", #24 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0cd0) g=0x17026c, at ./../generic/tclExecute.c:845 start=256, ps1=0x1cbd9c ">>> ", ps2=0x25a7e4 "... ", err_ret=0xffbef9b0) #25 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370c8, flags=0) at parsetok.c:75 at ./../generic/tclBasic.c:2723 #57 0x3a9c0 in PyRun_InteractiveOne (fp=0x18ebe8, filename=0xbf628 "<stdin>") #26 0xff258220 in TkInvokeButton (butPtr=0x279188) at ./../generic/tkButton.c:1457 at pythonrun.c:514 #27 0xff257698 in ButtonWidgetObjCmd (clientData=0x279188, interp=0x20e658, objc=2, #58 0x3a8bc in PyRun_InteractiveLoop (fp=0x18ebe8, filename=0xbf628 "<stdin>") objv=0x295e00) at ./../generic/tkButton.c:835 at pythonrun.c:478 #28 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, #59 0x3a7ac in PyRun_AnyFileEx (fp=0x18ebe8, filename=0xbf628 "<stdin>", closeit=0) command=0xff182128 "", at pythonrun.c:453 length=0, flags=262144) at ./../generic/tclParse.c:932 #60 0x3a76c in PyRun_AnyFile (fp=0x18ebe8, filename=0xbf628 "<stdin>") at #29 0xff15e2b8 in Tcl_EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, pythonrun.c:444 flags=262144) #61 0x1ff20 in Py_Main (argc=3, argv=0xffbefc74) at main.c:297 at ./../generic/tclParse.c:1019 #62 0x1f90c in main (argc=3, argv=0xffbefc74) at python.c:10 #30 0xff122928 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370e0, flags=262144) (gdb) at ./../generic/tclBasic.c:2565
GDB Traceback (gdb) where #31 0xff165544 in Tcl_UplevelObjCmd (dummy=0x1, interp=0x20e658, objc=1, #0 0xff1d9bf0 in __sigprocmask () from /usr/lib/libthread.so.1 objv=0x24ec80) #1 0xff1ce628 in _resetsig () from /usr/lib/libthread.so.1 at ./../generic/tclProc.c:614 #2 0xff1cdd18 in _sigon () from /usr/lib/libthread.so.1 #32 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0b70) #3 0xff1d0e8c in _thrp_kill () from /usr/lib/libthread.so.1 at ./../generic/tclExecute.c:845 #4 0xfee49b10 in raise () from /usr/lib/libc.so.1 #33 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x274d50, flags=0) #5 0xfee3512c in abort () from /usr/lib/libc.so.1 at ./../generic/tclBasic.c:2723 #6 0xfee353d0 in _assert () from /usr/lib/libc.so.1 #34 0xff165afc in TclObjInterpProc (clientData=0x1, interp=0x20e658, objc=0, #7 0xfeee13ec in abort_crash () from /u0/beazley/Projects/WAD/WAD/Test/./ objv=0xffbeebd8) at ./../generic/tclProc.c:1001 debugmodule.so #35 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0xffbeebd8, #8 0xfeee28ec in _wrap_abort_crash () command=0xffbef024 "n tkButtonUp .1907556n", length=25, flags=0) from /u0/beazley/Projects/WAD/WAD/Test/./debugmodule.so at ./../generic/tclParse.c:932 #9 0x281c8 in call_builtin (func=0x1cc4f0, arg=0x1f9424, kw=0x0) at ceval.c:2650 #36 0xff15e7d0 in Tcl_EvalEx (interp=0x20e658, #10 0x28094 in PyEval_CallObjectWithKeywords (func=0x1cc4f0, arg=0x1f9424, kw=0x0) script=0xffbef024 "n tkButtonUp .1907556n", numBytes=25, flags=-4264800) at ceval.c:2618 at ./../generic/tclParse.c:1393 #11 0x26764 in eval_code2 (co=0x1d37e0, globals=0x0, locals=0x1d37cf, args=0x1cc4f0, #37 0xff15e9c0 in Tcl_Eval (interp=0x20e658, argcount=1762552, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at string=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclParse.c:1512 ceval.c:1951 #38 0xff1243d0 in Tcl_GlobalEval (interp=0x20e658, (gdb) where #12 0x263a0 in eval_code2 (co=0x1d3858, globals=0x0, locals=0x1cc4f0, args=0x19b1a4, command=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclBasic.c:4139 #0 0xff1d9bf0 in __sigprocmask () from /usr/lib/libthread.so.1 argcount=1883008, kws=0x1d7318, kwcount=0, defs=0x0, defcount=0, owner=0x0) at ceval.c:1850 #39 0xff221a40 in Tk_BindEvent (bindingTable=0xffbef024, eventPtr=0x29ffa0, tkwin=0x2790a8, numObjects=2045728, objectPtr=0xffbef170) at ./../generic/ #1 0xff1ce628 in _resetsig () from /usr/lib/libthread.so.1 #13 0x263a0 in eval_code2 (co=0x1d3e50, globals=0x0, locals=0x19b1a4, args=0x1a7374, argcount=1883128, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at tkBind.c:1784 #40 0xff226450 in TkBindEventProc (winPtr=0x2790a8, eventPtr=0x29ffa0) ceval.c:1850 #2 0xff1cdd18 in _sigon () from /usr/lib/libthread.so.1 at ./../generic/tkCmds.c:244 #14 0x285e0 in call_function (func=0x1a73a4, arg=0x18f114, kw=0x0) at ceval.c:2772 #41 0xff22c218 in Tk_HandleEvent (eventPtr=0x29ffa0) at ./../generic/tkEvent.c:737 #3 0xff1d0e8c in _thrp_kill () from /usr/lib/libthread.so.1 #15 0x28080 in PyEval_CallObjectWithKeywords (func=0x1a73a4, arg=0x18f114, kw=0x0) #42 0xff22c61c in WindowEventProc (evPtr=0x29ff98, flags=-1) at ./../generic/ at ceval.c:2616 #4 0xfee49b10 in raise () from /usr/lib/libc.so.1 #16 0x680b0 in builtin_apply (self=0x0, args=0x0) at bltinmodule.c:88 tkEvent.c:1072 #43 0xff15bb54 in Tcl_ServiceEvent (flags=-1) at ./../generic/tclNotify.c:607 #5 0xfee3512c in abort () from /usr/lib/libc.so.1 #17 0x281c8 in call_builtin (func=0x1910c8, arg=0x1f9b54, kw=0x0) at ceval.c:2650 #18 0x28094 in PyEval_CallObjectWithKeywords (func=0x1910c8, arg=0x1f9b54, kw=0x0) #44 0xff15beec in Tcl_DoOneEvent (flags=-1) at ./../generic/tclNotify.c:846 #45 0x99314 in EventHook () at ./_tkinter.c:2020 at ceval.c:2618#6 0xfee353d0 in _assert () from /usr/lib/libc.so.1 #46 0xbaf30 in rl_read_key () at input.c:374 #19 0x26764 in eval_code2 (co=0x1f3948, globals=0x0, locals=0x1f38f0, args=0x1910c8, #47 0xac920 in readline_internal_char () at readline.c:454 #7 0xfeee13ec in abort_crash () from /u0/beazley/Projects/WAD/WAD/Test/ argcount=1733540, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x2436e4) #48 0xaca64 in readline_internal_charloop () at readline.c:507 at ceval.c:1951 ./debugmodule.so #20 0x285e0 in call_function (func=0x24374c, arg=0x1a606c, kw=0x0) at ceval.c:2772 #49 0xaca94 in readline_internal () at readline.c:521 #50 0xac704 in readline (prompt=0x1cbd9c ">>> ") at readline.c:349 #8 0xfeee28ec in _wrap_abort_crash () #21 0x28080 in PyEval_CallObjectWithKeywords (func=0x261414, arg=0x18f114, kw=0x0) at ceval.c:2616 #51 0x8249c in call_readline (prompt=0x1cbd9c ">>> ") at ./readline.c:462 #52 0x21ae0 in PyOS_Readline (prompt=0x1cbd9c ">>> ") at myreadline.c:118 from /u0/beazley/Projects/WAD/WAD/Test/./debugmodule.so #22 0x98064 in PythonCmd (clientData=0x1cc8e0, interp=0x20e658, argc=0, #53 0x205a0 in tok_nextc (tok=0x27abd0) at tokenizer.c:192 argv=0xffbee060) #54 0x20fb4 in PyTokenizer_Get (tok=0x27abd0, p_start=0xffbef8c4, p_end=0xffbef8c0) #9 0x281c8 in call_builtin (func=0x1cc4f0, arg=0x1f9424, kw=0x0) at at ./_tkinter.c:1274 at tokenizer.c:516 ceval.c:2650 #23 0xff122064 in TclInvokeStringCommand (clientData=0x278538, interp=0x20e658, objc=1, #55 0x20274 in parsetok (tok=0x27abd0, g=0x17026c, start=256, err_ret=0xffbef9b0) at parsetok.c:128 #10 0x28094 in PyEval_CallObjectWithKeywords (func=0x1cc4f0, objv=0x24ec84) at ./../generic/tclBasic.c:1752 #24 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0cd0) #56 0x20158 in PyParser_ParseFile (fp=0x18ebe8, filename=0xbf628 "<stdin>", g=0x17026c, arg=0x1f9424, kw=0x0) at ./../generic/tclExecute.c:845 start=256, ps1=0x1cbd9c ">>> ", ps2=0x25a7e4 "... ", err_ret=0xffbef9b0) #25 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370c8, flags=0) at parsetok.c:75 at ceval.c:2618 at ./../generic/tclBasic.c:2723 #57 0x3a9c0 in PyRun_InteractiveOne (fp=0x18ebe8, filename=0xbf628 "<stdin>") #26 0xff258220 in TkInvokeButton (butPtr=0x279188) at ./../generic/tkButton.c:1457 at pythonrun.c:514 #27 0xff257698 in ButtonWidgetObjCmd (clientData=0x279188, interp=0x20e658, objc=2, #58 0x3a8bc in PyRun_InteractiveLoop (fp=0x18ebe8, filename=0xbf628 "<stdin>") objv=0x295e00) at ./../generic/tkButton.c:835 at pythonrun.c:478 #28 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, #59 0x3a7ac in PyRun_AnyFileEx (fp=0x18ebe8, filename=0xbf628 "<stdin>", closeit=0) command=0xff182128 "", at pythonrun.c:453 length=0, flags=262144) at ./../generic/tclParse.c:932 #60 0x3a76c in PyRun_AnyFile (fp=0x18ebe8, filename=0xbf628 "<stdin>") at #29 0xff15e2b8 in Tcl_EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, pythonrun.c:444 flags=262144) #61 0x1ff20 in Py_Main (argc=3, argv=0xffbefc74) at main.c:297 at ./../generic/tclParse.c:1019 #62 0x1f90c in main (argc=3, argv=0xffbefc74) at python.c:10 #30 0xff122928 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370e0, flags=262144) (gdb) at ./../generic/tclBasic.c:2565
The Debugging Problem Problem • Debugger doesn’t know anything about script code • Mostly provides information about the implementation of the interpreter • Can’t fully answer question of “how did I get here?” Sometimes it is hard to reproduce a problem • Run-time environment may be complex • Problems may be due to timing or unusual event sequences • Problem may only occur after a long period of time • May not get any debugging information after a crash (no core file, no messages) Other issues • Requires users to run a separate debugger • Requires users to have a C development environment • Assumes users know how to use the debugger • Makes it difficult for an extension developer to get useful feedback Question: Can you do any better?
Embedded Error Recovery Idea: • Most interpreters have exception/error handling. • Maybe fatal extension errors could be handled as exceptions Interpreter Interpreter DeathException Extension Extension Code Code Death (core dumped) Why this approach? • Allows interpreter to unwind its internal call stack and provide details • Might work better in a very complex execution environment • Self contained
WAD Wrapped Application Debugger • A proof of concept implementation of embedded error recovery • A self-contained scripting language extension module • Requires no modifications or recompilation of any code Demo • Tcl/Tk • Mixed language debugging in Python • Apache + mod_python
WAD Demo % python >>> import debug >>> debug.seg_crash() Segmentation fault (core dumped) % % python >>> import debug >>> import libwadpy WAD Enabled >>> debug.seg_crash() Traceback (most recent call last): File "<stdin>", line 1, in ? SegFault: [ C stack trace ] #2 0x000281c0 in call_builtin(func=0x1cbaf0,arg=0x18f114,kw=0x0) in 'ceval.c', line 2650 #1 0xfeee26b8 in _wrap_seg_crash(self=0x0,args=0x18f114) in 'pydebug.c', line 510 #0 0xfeee1258 in seg_crash(0x1,0xfeef2d48,0x19a9f8,0x0,0x7365675f,0x5f5f6469) in 'debug.c', line 18 /u0/beazley/Projects/WAD/WAD/Test/debug.c, line 18 int seg_crash() { int *a = 0; => *a = 3; return 1; }
Big Picture WAD • WAD is a dynamically loadable extension module • Converts catastrophic extension errors to scripting exceptions Key features • No modifications to interpreter or extension code • No recompilation • No relinking • No separate debugger required (gdb, dbx, etc.) • No C, C++, Fortran development environment needed • No added performance penalty (no instrumentation, tracing, etc.) The rest of this talk • How it actually works • Limitations • Discussion
Calling Extension Code Scripting languages provide a foreign function interface Scripting Interpreter call_builtin() NULL, args (result,status) TCL_OK, TCL_ERROR, ... Wrappers args result C/C++ Extension • A small number of exit points from the interpreter (e.g., call_builtin above) • External functions are often wrappers that call the real function • Success indicated by returning a special status code (e.g., NULL, TCL_OK, ...) • Returning an error status generates an interpreter exception/error
Extension Errors Fatal errors result in signals Scripting Interpreter call_func() args Wrappers args C/C++ Extension signal core dumped • SIGSEGV, SIGBUS, SIGABRT, SIGILL, SIGFPE, etc.
Catching Extension Errors Install a special signal handler Scripting Interpreter call_func() args error return from signal Wrappers WAD 1. Inspect stack args 2. Collect info C/C++ Extension signal 3. Raise exception 4. Return • On signal, inspect process stack and collect debugging information • Force a return to the interpreter with a suitable error status • Interpreter doesn’t know anything "bad" happened (looks like a normal error)
WAD Implementation - Signals signal get unwind stack context symbols/debug raise exception signal modify context return Signal handling • Use sigaction() with special options for receiving process context • Handler gets PC, SP and other CPU registers , • Modifications to context take effect on return from handler • Commonly used to implement user threads • We’re going to use it to back out of extension code
WAD Implementation - Unwinding signal get 0x0001f7c4 unwind stack context 0x0001f904 0x0001fedc 0x0003a7c8 symbols/debug 0x000281c0 ... 0xfeee241c 0xfeee1178 raise exception 0x000bee2c 0xfee350e4 0xfee49b08 signal 0xff1d0e84 modify context 0xff1cdd10 return 0xff1d9bf0 error Stack unwinding • Read process virtual memory map from /proc • Unwind entire call stack of the process using PC and SP from process context • Use memory map to validate (in case of corrupted stack frames) • Get list of PC values and raw stack frames
WAD Implementation - Symbols signal signal _start unwind stack handler main Py_Main ... symbols/debug PyEval_CallObjectW call_builtin _wrap_spam spam raise exception __eprintf abort raise signal _thrp_kill modify context _sigon return __sigprocmask Symbols and debugging information • Collect debugging information from ELF symbol table and STABS • Get list of function names, source files, line numbers, arguments, etc. • Final result is a linked-list of annotated stack frames.
WAD Implementation - Exceptions signal signal _start unwind stack handler main Py_Main ... Interpreter symbols/debug PyEval_CallObjectW call_builtin _wrap_spam spam raise exception __eprintf abort raise Extension Code signal _thrp_kill search modify context _sigon return __sigprocmask Raising an exception • Search the call stack to find the first call to extension code (e.g., call_builtin) • Note: this depends on the scripting language (more later) • If match found, stack trace is passed to a language-specific handler function • Handler raises an appropriate scripting language exception with stack trace info
WAD Implementation - Context signal signal _start unwind stack handler main Py_Main ... symbols/debug PyEval_CallObjectW call_builtin _wrap_spam spam raise exception __eprintf abort raise signal _thrp_kill modify context _sigon return __sigprocmask Context modification • Bottom of call stack is chopped off
WAD Implementation - Context signal signal _start unwind stack handler main Py_Main ... symbols/debug PyEval_CallObjectW call_builtin raise exception error signal modify context return Context modification • Bottom of call stack is chopped off • PC, SP and registers modified to emulate a function return with an error status , • Signal handler returns • Details a little hairy (more shortly)
Different Scripting Languages Most of WAD is scripting language neutral • Signal handling, unwinding, generation of stack trace, context modification Each scripting language handled as an extension Python module register syms = { symbol {"call_builtin", NULL}, table {"PyObject_Print", -1}, WAD {"PyObject_Repr",NULL}, Core ... } symbol name return value match void handle_error(stack *s) { error handler found ... } • Module registers names of interpreter functions that might call extension code • Return value is the value an extension function must return to signal an error • Handler only invoked if WAD matches symbol name on call stack
Context Modification Return mechanism is similar to • setjmp/longjmp in C • C++ exception handling. However • The interpreter is not instrumented or modified in any way. • There is no corresponding setjmp() call. • There is no matching try { ... } clause in C++. Problem : corrupted CPU registers • Interpreter not designed with aborted procedure return in mind • By chopping off call stack, we return to the interpreter with inconsistent state • Callee-save registers not restored properly However, we can mostly fix this...
Register Recovery Solution: SPARC • Each procedure gets a fresh set of CPU registers (i.e., a “window”) • To restore state: simply roll back the register windows Solution: i386 • Inspect machine instructions of function prologues • Figure out where callee-save registers were saved on call-stack • Restore values while walking up the stack. blah: 55 pushl %ebp 89 e5 movl %esp,%ebp 83 ec 2c subl $0x2c,%esp Size of locals 57 pushl %edi 56 pushl %esi Saved registers 53 pushl %ebx • Only a heuristic. Might get it wrong, but it usually works.
Implementation Details WAD implementation • Mostly ANSI C, some assembly code, some C++ • ~1500 semicolons • Most code related to introspection (debugging, symbol tables, etc...) • Core is scripting language independent • Execution is isolated (own stack and memory management). • Does not rely upon third party libraries (e.g., libbfd). Compatibility • Sun Sparc Solaris • i386 GNU/Linux (recent kernels). • Python and Tcl • Perl (was working, currently broken)
Limitations Aborted execution • May leak memory • No destruction of objects in C++ • May interact poorly with C++ exceptions • May result in unreleased system resources (files, sockets, etc.) • May result in deadlock (if holding locks when error occurs) Unrecoverable errors • Errors that destroy or corrupt the interpreter • Stack overflow (results in corrupted call-stack) Compiler optimization • False reporting of debugging data, source files, and lines • Incorrect register recovery (-fomit-frame-pointer)
Limitations Threads • Mixing threads and signals is an extremely delicate topic • WAD requires a fully functional signal implementation • Some versions older versions of GNU/Linux do not seem to work correctly Threads present other issues • Signals of interest are handled by the faulting thread • However, some interpreters place a mutex lock around the interpreter Python Extension acquire_lock() ... func() { call_builtin() store_thread_state() release_lock() abort ... run ... ??? acquire_lock() restore_thread_state() success ... return release_lock() } • If signal occurs while lock released, we violate mutual exclusion on interpreter • May deliver exception to wrong thread (probably bad)
Limitations Interference with normal signal handling • Scripting languages also provide support for signals • Signals caught by WAD are unlikely to be used in a script • One exception : Perl sigtrap module • User can disable WAD by installing a new signal handler Stripped symbol tables • May make it difficult to find return location • May provide only limited debugging information • Can find some symbols in dynamic loading tables, but may not be enough Things that just won’t work • Breakpoints • Single-step execution • Restart • WAD lives in the same process
Discussion What happens if WAD breaks? • Worst case: program core dumps • Can still load core in debugger • Will get a few extra stack frames (for WAD) Does WAD prevent the use of a debugger? • Traditional debugger can still be attached to process • Debugger overrides signal handling (which disables WAD) Repeated errors • If interpreter is corrupted, may crash immediately after return • This reinvokes WAD with new context information • May get several different tracebacks, followed by a core dump. Continued execution • After an error occurs, is it safe to resume execution? • Depends on the error • Depends on the application • Primary goal is to get error information, not to run forever
Discussion Wouldn’t it be easier to modify the interpreter? • Use of setjmp()/longjmp() would simplify implementation, avoid register problems • However, number of external exits may be substantial (extension types) • Many interpreters are re-entrant (extensions can call eval) • Performance impact? • Difficult to maintain unless patches made part of interpreter distribution • Not nearly as cool How hard is it to port WAD to other platforms? • WAD language modules are platform-neutral • Operate on data that is presented in a generic format • WAD core would need to be modified • /proc, process context, register recovery, ELF, STABS, etc. • Note: Solaris and GNU/Linux versions are nearly identical however.
Related Work Bits and pieces in many different areas • Programming environments (Common Lisp, Java?) • Modifications to gdb (common lisp, Python, others?) • Perl sigtrap (prints a perl stack trace on fatal signal) • Exception handling • Asynchronous exception handling in programming languages (ML, Haskell, etc.) • PyDebug (persistent breakpoints, interface to gdb from Python) More references? • Debugging and exception handling are huge topics • Not aware of anything quite like WAD in scripting language world • Would be interested in getting feedback
Future Directions A lot of people are building scripting language extensions • Debugging support can be greatly improved • Hope to see more work in this area WAD • Improve the core • Better interaction with threads and C++ • Support for more languages • Support for other platforms • Interaction with other debuggers and IDEs Bottom line • Still a lot of hard problems to work on
More Information http://systems.cs.uchicago.edu/wad • Freely available under LGPL • This is work in progress • Volunteers welcome • I’m also looking for students

An Embedded Error Recovery and Debugging Mechanism for Scripting Language Extensions

  • 1.
    PY-MA An Embedded ErrorRecovery and Debugging Mechanism for Scripting Language Extensions David M. Beazley Department of Computer Science The University of Chicago beazley@cs.uchicago.edu June 29, 2001
  • 2.
    Outline Scripting language extensionsand debugging Wrapped Application Debugger (WAD) Discussion
  • 3.
    Scripting Language Extensions Scriptinglanguages • Tcl, Perl, Python, Ruby, Guile, PHP etc. , Scripting extensions • Extend scripting interpreters with compiled functions and types (in C/C++) • Provide access to popular libraries and existing software • Performance optimization • A variety of automated tools can help (e.g., SWIG) Benefits • Scripting languages make a lot of things easier • User interface development • Web programming • Rapid prototyping, debugging, etc. • Plus, scripting languages are fun to use
  • 4.
    Example Scripting Interpreter Dynamic Loading foo.so bar.so ... blah.so Extension building • Extensions packaged as shared libraries • Dynamic loading used to import modules • Contents of extension modules add new commands or types • Everything lives in the same process • Interpreter responsible for high-level control Examples • GUI built with Tcl/Tk • Scientific/engineering software (e.g., MATLAB)
  • 5.
    A More ComplexExample Apache libphp3.so mod_perl.so ... mod_python.so Embedded Python Interpreter Compiled Extensions MySQLdb.so blah.so .py .py .py .py .py scripts • May have multiple interpreters and a mix of C code and scripts • Also may include IPC, sockets, subprocesses, threads, etc. • Bottom line: a much more complicated runtime environment
  • 6.
    Scripting Language Errors Ifyou make a mistake in a script, you get an error % python spam.py Traceback (most recent call last): File "spam.py", line 15, in ? blah() File "spam.py", line 12, in blah bar() File "spam.py", line 9, in bar foo() File "spam.py", line 6, in foo spam(3) File "spam.py", line 3, in spam doh(n) NameError: There is no variable named 'doh' To fix: • Just look at the error message and make a change • Run a script-level debugger
  • 7.
    Errors in ExtensionCode Uh oh... % python spam.py Segmentation Fault (core dumped) or % python spam.py Bus Error (core dumped) or % python spam.py Assertion failed: n > 0, file debug.c, line 54 Abort (core dumped) Obviously something “bad” has happened • Even worse: May get no error message at all (mod_python) • May not get a core file either
  • 8.
    Common Failure Modes UninitializedData • Improper initialization of libraries (forgot to call an initialization function?) • Calling functions in the wrong order? Improper argument checking • Passing of NULL pointers • Improper conversion of scripting objects to C Failed assertions • Extensive use of assert(). Weird stuff • Illegal instructions • Bus error. Memory alignment problems Math errors • Floating point exception (divide by zero?). To fix: • Trial and error with print statements? • Cursing?
  • 9.
    GDB Traceback (gdb) where #31 0xff165544 in Tcl_UplevelObjCmd (dummy=0x1, interp=0x20e658, objc=1, #0 0xff1d9bf0 in __sigprocmask () from /usr/lib/libthread.so.1 objv=0x24ec80) #1 0xff1ce628 in _resetsig () from /usr/lib/libthread.so.1 at ./../generic/tclProc.c:614 #2 0xff1cdd18 in _sigon () from /usr/lib/libthread.so.1 #32 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0b70) #3 0xff1d0e8c in _thrp_kill () from /usr/lib/libthread.so.1 at ./../generic/tclExecute.c:845 #4 0xfee49b10 in raise () from /usr/lib/libc.so.1 #33 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x274d50, flags=0) #5 0xfee3512c in abort () from /usr/lib/libc.so.1 at ./../generic/tclBasic.c:2723 #6 0xfee353d0 in _assert () from /usr/lib/libc.so.1 #34 0xff165afc in TclObjInterpProc (clientData=0x1, interp=0x20e658, objc=0, #7 0xfeee13ec in abort_crash () from /u0/beazley/Projects/WAD/WAD/Test/./ objv=0xffbeebd8) at ./../generic/tclProc.c:1001 debugmodule.so #35 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0xffbeebd8, #8 0xfeee28ec in _wrap_abort_crash () command=0xffbef024 "n tkButtonUp .1907556n", length=25, flags=0) from /u0/beazley/Projects/WAD/WAD/Test/./debugmodule.so at ./../generic/tclParse.c:932 #9 0x281c8 in call_builtin (func=0x1cc4f0, arg=0x1f9424, kw=0x0) at ceval.c:2650 #36 0xff15e7d0 in Tcl_EvalEx (interp=0x20e658, #10 0x28094 in PyEval_CallObjectWithKeywords (func=0x1cc4f0, arg=0x1f9424, kw=0x0) script=0xffbef024 "n tkButtonUp .1907556n", numBytes=25, flags=-4264800) at ceval.c:2618 at ./../generic/tclParse.c:1393 #11 0x26764 in eval_code2 (co=0x1d37e0, globals=0x0, locals=0x1d37cf, args=0x1cc4f0, #37 0xff15e9c0 in Tcl_Eval (interp=0x20e658, argcount=1762552, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at string=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclParse.c:1512 ceval.c:1951 #38 0xff1243d0 in Tcl_GlobalEval (interp=0x20e658, #12 0x263a0 in eval_code2 (co=0x1d3858, globals=0x0, locals=0x1cc4f0, args=0x19b1a4, command=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclBasic.c:4139 argcount=1883008, kws=0x1d7318, kwcount=0, defs=0x0, defcount=0, owner=0x0) #39 0xff221a40 in Tk_BindEvent (bindingTable=0xffbef024, eventPtr=0x29ffa0, at ceval.c:1850 tkwin=0x2790a8, numObjects=2045728, objectPtr=0xffbef170) at ./../generic/ #13 0x263a0 in eval_code2 (co=0x1d3e50, globals=0x0, locals=0x19b1a4, args=0x1a7374, tkBind.c:1784 argcount=1883128, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at #40 0xff226450 in TkBindEventProc (winPtr=0x2790a8, eventPtr=0x29ffa0) ceval.c:1850 at ./../generic/tkCmds.c:244 #14 0x285e0 in call_function (func=0x1a73a4, arg=0x18f114, kw=0x0) at ceval.c:2772 #41 0xff22c218 in Tk_HandleEvent (eventPtr=0x29ffa0) at ./../generic/tkEvent.c:737 #15 0x28080 in PyEval_CallObjectWithKeywords (func=0x1a73a4, arg=0x18f114, kw=0x0) #42 0xff22c61c in WindowEventProc (evPtr=0x29ff98, flags=-1) at ./../generic/ at ceval.c:2616 tkEvent.c:1072 #16 0x680b0 in builtin_apply (self=0x0, args=0x0) at bltinmodule.c:88 #43 0xff15bb54 in Tcl_ServiceEvent (flags=-1) at ./../generic/tclNotify.c:607 #17 0x281c8 in call_builtin (func=0x1910c8, arg=0x1f9b54, kw=0x0) at ceval.c:2650 #44 0xff15beec in Tcl_DoOneEvent (flags=-1) at ./../generic/tclNotify.c:846 #18 0x28094 in PyEval_CallObjectWithKeywords (func=0x1910c8, arg=0x1f9b54, kw=0x0) #45 0x99314 in EventHook () at ./_tkinter.c:2020 at ceval.c:2618 #46 0xbaf30 in rl_read_key () at input.c:374 #19 0x26764 in eval_code2 (co=0x1f3948, globals=0x0, locals=0x1f38f0, args=0x1910c8, #47 0xac920 in readline_internal_char () at readline.c:454 argcount=1733540, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x2436e4) #48 0xaca64 in readline_internal_charloop () at readline.c:507 at ceval.c:1951 #49 0xaca94 in readline_internal () at readline.c:521 #20 0x285e0 in call_function (func=0x24374c, arg=0x1a606c, kw=0x0) at ceval.c:2772 #50 0xac704 in readline (prompt=0x1cbd9c ">>> ") at readline.c:349 #21 0x28080 in PyEval_CallObjectWithKeywords (func=0x261414, arg=0x18f114, kw=0x0) #51 0x8249c in call_readline (prompt=0x1cbd9c ">>> ") at ./readline.c:462 at ceval.c:2616 #52 0x21ae0 in PyOS_Readline (prompt=0x1cbd9c ">>> ") at myreadline.c:118 #22 0x98064 in PythonCmd (clientData=0x1cc8e0, interp=0x20e658, argc=0, #53 0x205a0 in tok_nextc (tok=0x27abd0) at tokenizer.c:192 argv=0xffbee060) #54 0x20fb4 in PyTokenizer_Get (tok=0x27abd0, p_start=0xffbef8c4, p_end=0xffbef8c0) at ./_tkinter.c:1274 at tokenizer.c:516 #23 0xff122064 in TclInvokeStringCommand (clientData=0x278538, interp=0x20e658, #55 0x20274 in parsetok (tok=0x27abd0, g=0x17026c, start=256, err_ret=0xffbef9b0) objc=1, at parsetok.c:128 objv=0x24ec84) at ./../generic/tclBasic.c:1752 #56 0x20158 in PyParser_ParseFile (fp=0x18ebe8, filename=0xbf628 "<stdin>", #24 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0cd0) g=0x17026c, at ./../generic/tclExecute.c:845 start=256, ps1=0x1cbd9c ">>> ", ps2=0x25a7e4 "... ", err_ret=0xffbef9b0) #25 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370c8, flags=0) at parsetok.c:75 at ./../generic/tclBasic.c:2723 #57 0x3a9c0 in PyRun_InteractiveOne (fp=0x18ebe8, filename=0xbf628 "<stdin>") #26 0xff258220 in TkInvokeButton (butPtr=0x279188) at ./../generic/tkButton.c:1457 at pythonrun.c:514 #27 0xff257698 in ButtonWidgetObjCmd (clientData=0x279188, interp=0x20e658, objc=2, #58 0x3a8bc in PyRun_InteractiveLoop (fp=0x18ebe8, filename=0xbf628 "<stdin>") objv=0x295e00) at ./../generic/tkButton.c:835 at pythonrun.c:478 #28 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, #59 0x3a7ac in PyRun_AnyFileEx (fp=0x18ebe8, filename=0xbf628 "<stdin>", closeit=0) command=0xff182128 "", at pythonrun.c:453 length=0, flags=262144) at ./../generic/tclParse.c:932 #60 0x3a76c in PyRun_AnyFile (fp=0x18ebe8, filename=0xbf628 "<stdin>") at #29 0xff15e2b8 in Tcl_EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, pythonrun.c:444 flags=262144) #61 0x1ff20 in Py_Main (argc=3, argv=0xffbefc74) at main.c:297 at ./../generic/tclParse.c:1019 #62 0x1f90c in main (argc=3, argv=0xffbefc74) at python.c:10 #30 0xff122928 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370e0, flags=262144) (gdb) at ./../generic/tclBasic.c:2565
  • 10.
    GDB Traceback (gdb) where #31 0xff165544 in Tcl_UplevelObjCmd (dummy=0x1, interp=0x20e658, objc=1, #0 0xff1d9bf0 in __sigprocmask () from /usr/lib/libthread.so.1 objv=0x24ec80) #1 0xff1ce628 in _resetsig () from /usr/lib/libthread.so.1 at ./../generic/tclProc.c:614 #2 0xff1cdd18 in _sigon () from /usr/lib/libthread.so.1 #32 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0b70) #3 0xff1d0e8c in _thrp_kill () from /usr/lib/libthread.so.1 at ./../generic/tclExecute.c:845 #4 0xfee49b10 in raise () from /usr/lib/libc.so.1 #33 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x274d50, flags=0) #5 0xfee3512c in abort () from /usr/lib/libc.so.1 at ./../generic/tclBasic.c:2723 #6 0xfee353d0 in _assert () from /usr/lib/libc.so.1 #34 0xff165afc in TclObjInterpProc (clientData=0x1, interp=0x20e658, objc=0, #7 0xfeee13ec in abort_crash () from /u0/beazley/Projects/WAD/WAD/Test/./ objv=0xffbeebd8) at ./../generic/tclProc.c:1001 debugmodule.so #35 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0xffbeebd8, #8 0xfeee28ec in _wrap_abort_crash () command=0xffbef024 "n tkButtonUp .1907556n", length=25, flags=0) from /u0/beazley/Projects/WAD/WAD/Test/./debugmodule.so at ./../generic/tclParse.c:932 #9 0x281c8 in call_builtin (func=0x1cc4f0, arg=0x1f9424, kw=0x0) at ceval.c:2650 #36 0xff15e7d0 in Tcl_EvalEx (interp=0x20e658, #10 0x28094 in PyEval_CallObjectWithKeywords (func=0x1cc4f0, arg=0x1f9424, kw=0x0) script=0xffbef024 "n tkButtonUp .1907556n", numBytes=25, flags=-4264800) at ceval.c:2618 at ./../generic/tclParse.c:1393 #11 0x26764 in eval_code2 (co=0x1d37e0, globals=0x0, locals=0x1d37cf, args=0x1cc4f0, #37 0xff15e9c0 in Tcl_Eval (interp=0x20e658, argcount=1762552, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at string=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclParse.c:1512 ceval.c:1951 #38 0xff1243d0 in Tcl_GlobalEval (interp=0x20e658, (gdb) where #12 0x263a0 in eval_code2 (co=0x1d3858, globals=0x0, locals=0x1cc4f0, args=0x19b1a4, command=0xffbef024 "n tkButtonUp .1907556n") at ./../generic/tclBasic.c:4139 #0 0xff1d9bf0 in __sigprocmask () from /usr/lib/libthread.so.1 argcount=1883008, kws=0x1d7318, kwcount=0, defs=0x0, defcount=0, owner=0x0) at ceval.c:1850 #39 0xff221a40 in Tk_BindEvent (bindingTable=0xffbef024, eventPtr=0x29ffa0, tkwin=0x2790a8, numObjects=2045728, objectPtr=0xffbef170) at ./../generic/ #1 0xff1ce628 in _resetsig () from /usr/lib/libthread.so.1 #13 0x263a0 in eval_code2 (co=0x1d3e50, globals=0x0, locals=0x19b1a4, args=0x1a7374, argcount=1883128, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at tkBind.c:1784 #40 0xff226450 in TkBindEventProc (winPtr=0x2790a8, eventPtr=0x29ffa0) ceval.c:1850 #2 0xff1cdd18 in _sigon () from /usr/lib/libthread.so.1 at ./../generic/tkCmds.c:244 #14 0x285e0 in call_function (func=0x1a73a4, arg=0x18f114, kw=0x0) at ceval.c:2772 #41 0xff22c218 in Tk_HandleEvent (eventPtr=0x29ffa0) at ./../generic/tkEvent.c:737 #3 0xff1d0e8c in _thrp_kill () from /usr/lib/libthread.so.1 #15 0x28080 in PyEval_CallObjectWithKeywords (func=0x1a73a4, arg=0x18f114, kw=0x0) #42 0xff22c61c in WindowEventProc (evPtr=0x29ff98, flags=-1) at ./../generic/ at ceval.c:2616 #4 0xfee49b10 in raise () from /usr/lib/libc.so.1 #16 0x680b0 in builtin_apply (self=0x0, args=0x0) at bltinmodule.c:88 tkEvent.c:1072 #43 0xff15bb54 in Tcl_ServiceEvent (flags=-1) at ./../generic/tclNotify.c:607 #5 0xfee3512c in abort () from /usr/lib/libc.so.1 #17 0x281c8 in call_builtin (func=0x1910c8, arg=0x1f9b54, kw=0x0) at ceval.c:2650 #18 0x28094 in PyEval_CallObjectWithKeywords (func=0x1910c8, arg=0x1f9b54, kw=0x0) #44 0xff15beec in Tcl_DoOneEvent (flags=-1) at ./../generic/tclNotify.c:846 #45 0x99314 in EventHook () at ./_tkinter.c:2020 at ceval.c:2618#6 0xfee353d0 in _assert () from /usr/lib/libc.so.1 #46 0xbaf30 in rl_read_key () at input.c:374 #19 0x26764 in eval_code2 (co=0x1f3948, globals=0x0, locals=0x1f38f0, args=0x1910c8, #47 0xac920 in readline_internal_char () at readline.c:454 #7 0xfeee13ec in abort_crash () from /u0/beazley/Projects/WAD/WAD/Test/ argcount=1733540, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x2436e4) #48 0xaca64 in readline_internal_charloop () at readline.c:507 at ceval.c:1951 ./debugmodule.so #20 0x285e0 in call_function (func=0x24374c, arg=0x1a606c, kw=0x0) at ceval.c:2772 #49 0xaca94 in readline_internal () at readline.c:521 #50 0xac704 in readline (prompt=0x1cbd9c ">>> ") at readline.c:349 #8 0xfeee28ec in _wrap_abort_crash () #21 0x28080 in PyEval_CallObjectWithKeywords (func=0x261414, arg=0x18f114, kw=0x0) at ceval.c:2616 #51 0x8249c in call_readline (prompt=0x1cbd9c ">>> ") at ./readline.c:462 #52 0x21ae0 in PyOS_Readline (prompt=0x1cbd9c ">>> ") at myreadline.c:118 from /u0/beazley/Projects/WAD/WAD/Test/./debugmodule.so #22 0x98064 in PythonCmd (clientData=0x1cc8e0, interp=0x20e658, argc=0, #53 0x205a0 in tok_nextc (tok=0x27abd0) at tokenizer.c:192 argv=0xffbee060) #54 0x20fb4 in PyTokenizer_Get (tok=0x27abd0, p_start=0xffbef8c4, p_end=0xffbef8c0) #9 0x281c8 in call_builtin (func=0x1cc4f0, arg=0x1f9424, kw=0x0) at at ./_tkinter.c:1274 at tokenizer.c:516 ceval.c:2650 #23 0xff122064 in TclInvokeStringCommand (clientData=0x278538, interp=0x20e658, objc=1, #55 0x20274 in parsetok (tok=0x27abd0, g=0x17026c, start=256, err_ret=0xffbef9b0) at parsetok.c:128 #10 0x28094 in PyEval_CallObjectWithKeywords (func=0x1cc4f0, objv=0x24ec84) at ./../generic/tclBasic.c:1752 #24 0xff13e98c in TclExecuteByteCode (interp=0x20e658, codePtr=0x2a0cd0) #56 0x20158 in PyParser_ParseFile (fp=0x18ebe8, filename=0xbf628 "<stdin>", g=0x17026c, arg=0x1f9424, kw=0x0) at ./../generic/tclExecute.c:845 start=256, ps1=0x1cbd9c ">>> ", ps2=0x25a7e4 "... ", err_ret=0xffbef9b0) #25 0xff122bf8 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370c8, flags=0) at parsetok.c:75 at ceval.c:2618 at ./../generic/tclBasic.c:2723 #57 0x3a9c0 in PyRun_InteractiveOne (fp=0x18ebe8, filename=0xbf628 "<stdin>") #26 0xff258220 in TkInvokeButton (butPtr=0x279188) at ./../generic/tkButton.c:1457 at pythonrun.c:514 #27 0xff257698 in ButtonWidgetObjCmd (clientData=0x279188, interp=0x20e658, objc=2, #58 0x3a8bc in PyRun_InteractiveLoop (fp=0x18ebe8, filename=0xbf628 "<stdin>") objv=0x295e00) at ./../generic/tkButton.c:835 at pythonrun.c:478 #28 0xff15e18c in EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, #59 0x3a7ac in PyRun_AnyFileEx (fp=0x18ebe8, filename=0xbf628 "<stdin>", closeit=0) command=0xff182128 "", at pythonrun.c:453 length=0, flags=262144) at ./../generic/tclParse.c:932 #60 0x3a76c in PyRun_AnyFile (fp=0x18ebe8, filename=0xbf628 "<stdin>") at #29 0xff15e2b8 in Tcl_EvalObjv (interp=0x20e658, objc=2, objv=0x295e00, pythonrun.c:444 flags=262144) #61 0x1ff20 in Py_Main (argc=3, argv=0xffbefc74) at main.c:297 at ./../generic/tclParse.c:1019 #62 0x1f90c in main (argc=3, argv=0xffbefc74) at python.c:10 #30 0xff122928 in Tcl_EvalObjEx (interp=0x20e658, objPtr=0x2370e0, flags=262144) (gdb) at ./../generic/tclBasic.c:2565
  • 11.
    The Debugging Problem Problem • Debugger doesn’t know anything about script code • Mostly provides information about the implementation of the interpreter • Can’t fully answer question of “how did I get here?” Sometimes it is hard to reproduce a problem • Run-time environment may be complex • Problems may be due to timing or unusual event sequences • Problem may only occur after a long period of time • May not get any debugging information after a crash (no core file, no messages) Other issues • Requires users to run a separate debugger • Requires users to have a C development environment • Assumes users know how to use the debugger • Makes it difficult for an extension developer to get useful feedback Question: Can you do any better?
  • 12.
    Embedded Error Recovery Idea: • Most interpreters have exception/error handling. • Maybe fatal extension errors could be handled as exceptions Interpreter Interpreter DeathException Extension Extension Code Code Death (core dumped) Why this approach? • Allows interpreter to unwind its internal call stack and provide details • Might work better in a very complex execution environment • Self contained
  • 13.
    WAD Wrapped Application Debugger • A proof of concept implementation of embedded error recovery • A self-contained scripting language extension module • Requires no modifications or recompilation of any code Demo • Tcl/Tk • Mixed language debugging in Python • Apache + mod_python
  • 14.
    WAD Demo % python >>>import debug >>> debug.seg_crash() Segmentation fault (core dumped) % % python >>> import debug >>> import libwadpy WAD Enabled >>> debug.seg_crash() Traceback (most recent call last): File "<stdin>", line 1, in ? SegFault: [ C stack trace ] #2 0x000281c0 in call_builtin(func=0x1cbaf0,arg=0x18f114,kw=0x0) in 'ceval.c', line 2650 #1 0xfeee26b8 in _wrap_seg_crash(self=0x0,args=0x18f114) in 'pydebug.c', line 510 #0 0xfeee1258 in seg_crash(0x1,0xfeef2d48,0x19a9f8,0x0,0x7365675f,0x5f5f6469) in 'debug.c', line 18 /u0/beazley/Projects/WAD/WAD/Test/debug.c, line 18 int seg_crash() { int *a = 0; => *a = 3; return 1; }
  • 15.
    Big Picture WAD • WAD is a dynamically loadable extension module • Converts catastrophic extension errors to scripting exceptions Key features • No modifications to interpreter or extension code • No recompilation • No relinking • No separate debugger required (gdb, dbx, etc.) • No C, C++, Fortran development environment needed • No added performance penalty (no instrumentation, tracing, etc.) The rest of this talk • How it actually works • Limitations • Discussion
  • 16.
    Calling Extension Code Scriptinglanguages provide a foreign function interface Scripting Interpreter call_builtin() NULL, args (result,status) TCL_OK, TCL_ERROR, ... Wrappers args result C/C++ Extension • A small number of exit points from the interpreter (e.g., call_builtin above) • External functions are often wrappers that call the real function • Success indicated by returning a special status code (e.g., NULL, TCL_OK, ...) • Returning an error status generates an interpreter exception/error
  • 17.
    Extension Errors Fatal errorsresult in signals Scripting Interpreter call_func() args Wrappers args C/C++ Extension signal core dumped • SIGSEGV, SIGBUS, SIGABRT, SIGILL, SIGFPE, etc.
  • 18.
    Catching Extension Errors Installa special signal handler Scripting Interpreter call_func() args error return from signal Wrappers WAD 1. Inspect stack args 2. Collect info C/C++ Extension signal 3. Raise exception 4. Return • On signal, inspect process stack and collect debugging information • Force a return to the interpreter with a suitable error status • Interpreter doesn’t know anything "bad" happened (looks like a normal error)
  • 19.
    WAD Implementation -Signals signal get unwind stack context symbols/debug raise exception signal modify context return Signal handling • Use sigaction() with special options for receiving process context • Handler gets PC, SP and other CPU registers , • Modifications to context take effect on return from handler • Commonly used to implement user threads • We’re going to use it to back out of extension code
  • 20.
    WAD Implementation -Unwinding signal get 0x0001f7c4 unwind stack context 0x0001f904 0x0001fedc 0x0003a7c8 symbols/debug 0x000281c0 ... 0xfeee241c 0xfeee1178 raise exception 0x000bee2c 0xfee350e4 0xfee49b08 signal 0xff1d0e84 modify context 0xff1cdd10 return 0xff1d9bf0 error Stack unwinding • Read process virtual memory map from /proc • Unwind entire call stack of the process using PC and SP from process context • Use memory map to validate (in case of corrupted stack frames) • Get list of PC values and raw stack frames
  • 21.
    WAD Implementation -Symbols signal signal _start unwind stack handler main Py_Main ... symbols/debug PyEval_CallObjectW call_builtin _wrap_spam spam raise exception __eprintf abort raise signal _thrp_kill modify context _sigon return __sigprocmask Symbols and debugging information • Collect debugging information from ELF symbol table and STABS • Get list of function names, source files, line numbers, arguments, etc. • Final result is a linked-list of annotated stack frames.
  • 22.
    WAD Implementation -Exceptions signal signal _start unwind stack handler main Py_Main ... Interpreter symbols/debug PyEval_CallObjectW call_builtin _wrap_spam spam raise exception __eprintf abort raise Extension Code signal _thrp_kill search modify context _sigon return __sigprocmask Raising an exception • Search the call stack to find the first call to extension code (e.g., call_builtin) • Note: this depends on the scripting language (more later) • If match found, stack trace is passed to a language-specific handler function • Handler raises an appropriate scripting language exception with stack trace info
  • 23.
    WAD Implementation -Context signal signal _start unwind stack handler main Py_Main ... symbols/debug PyEval_CallObjectW call_builtin _wrap_spam spam raise exception __eprintf abort raise signal _thrp_kill modify context _sigon return __sigprocmask Context modification • Bottom of call stack is chopped off
  • 24.
    WAD Implementation -Context signal signal _start unwind stack handler main Py_Main ... symbols/debug PyEval_CallObjectW call_builtin raise exception error signal modify context return Context modification • Bottom of call stack is chopped off • PC, SP and registers modified to emulate a function return with an error status , • Signal handler returns • Details a little hairy (more shortly)
  • 25.
    Different Scripting Languages Mostof WAD is scripting language neutral • Signal handling, unwinding, generation of stack trace, context modification Each scripting language handled as an extension Python module register syms = { symbol {"call_builtin", NULL}, table {"PyObject_Print", -1}, WAD {"PyObject_Repr",NULL}, Core ... } symbol name return value match void handle_error(stack *s) { error handler found ... } • Module registers names of interpreter functions that might call extension code • Return value is the value an extension function must return to signal an error • Handler only invoked if WAD matches symbol name on call stack
  • 26.
    Context Modification Return mechanismis similar to • setjmp/longjmp in C • C++ exception handling. However • The interpreter is not instrumented or modified in any way. • There is no corresponding setjmp() call. • There is no matching try { ... } clause in C++. Problem : corrupted CPU registers • Interpreter not designed with aborted procedure return in mind • By chopping off call stack, we return to the interpreter with inconsistent state • Callee-save registers not restored properly However, we can mostly fix this...
  • 27.
    Register Recovery Solution: SPARC • Each procedure gets a fresh set of CPU registers (i.e., a “window”) • To restore state: simply roll back the register windows Solution: i386 • Inspect machine instructions of function prologues • Figure out where callee-save registers were saved on call-stack • Restore values while walking up the stack. blah: 55 pushl %ebp 89 e5 movl %esp,%ebp 83 ec 2c subl $0x2c,%esp Size of locals 57 pushl %edi 56 pushl %esi Saved registers 53 pushl %ebx • Only a heuristic. Might get it wrong, but it usually works.
  • 28.
    Implementation Details WAD implementation • Mostly ANSI C, some assembly code, some C++ • ~1500 semicolons • Most code related to introspection (debugging, symbol tables, etc...) • Core is scripting language independent • Execution is isolated (own stack and memory management). • Does not rely upon third party libraries (e.g., libbfd). Compatibility • Sun Sparc Solaris • i386 GNU/Linux (recent kernels). • Python and Tcl • Perl (was working, currently broken)
  • 29.
    Limitations Aborted execution • May leak memory • No destruction of objects in C++ • May interact poorly with C++ exceptions • May result in unreleased system resources (files, sockets, etc.) • May result in deadlock (if holding locks when error occurs) Unrecoverable errors • Errors that destroy or corrupt the interpreter • Stack overflow (results in corrupted call-stack) Compiler optimization • False reporting of debugging data, source files, and lines • Incorrect register recovery (-fomit-frame-pointer)
  • 30.
    Limitations Threads • Mixing threads and signals is an extremely delicate topic • WAD requires a fully functional signal implementation • Some versions older versions of GNU/Linux do not seem to work correctly Threads present other issues • Signals of interest are handled by the faulting thread • However, some interpreters place a mutex lock around the interpreter Python Extension acquire_lock() ... func() { call_builtin() store_thread_state() release_lock() abort ... run ... ??? acquire_lock() restore_thread_state() success ... return release_lock() } • If signal occurs while lock released, we violate mutual exclusion on interpreter • May deliver exception to wrong thread (probably bad)
  • 31.
    Limitations Interference with normalsignal handling • Scripting languages also provide support for signals • Signals caught by WAD are unlikely to be used in a script • One exception : Perl sigtrap module • User can disable WAD by installing a new signal handler Stripped symbol tables • May make it difficult to find return location • May provide only limited debugging information • Can find some symbols in dynamic loading tables, but may not be enough Things that just won’t work • Breakpoints • Single-step execution • Restart • WAD lives in the same process
  • 32.
    Discussion What happens ifWAD breaks? • Worst case: program core dumps • Can still load core in debugger • Will get a few extra stack frames (for WAD) Does WAD prevent the use of a debugger? • Traditional debugger can still be attached to process • Debugger overrides signal handling (which disables WAD) Repeated errors • If interpreter is corrupted, may crash immediately after return • This reinvokes WAD with new context information • May get several different tracebacks, followed by a core dump. Continued execution • After an error occurs, is it safe to resume execution? • Depends on the error • Depends on the application • Primary goal is to get error information, not to run forever
  • 33.
    Discussion Wouldn’t it beeasier to modify the interpreter? • Use of setjmp()/longjmp() would simplify implementation, avoid register problems • However, number of external exits may be substantial (extension types) • Many interpreters are re-entrant (extensions can call eval) • Performance impact? • Difficult to maintain unless patches made part of interpreter distribution • Not nearly as cool How hard is it to port WAD to other platforms? • WAD language modules are platform-neutral • Operate on data that is presented in a generic format • WAD core would need to be modified • /proc, process context, register recovery, ELF, STABS, etc. • Note: Solaris and GNU/Linux versions are nearly identical however.
  • 34.
    Related Work Bits andpieces in many different areas • Programming environments (Common Lisp, Java?) • Modifications to gdb (common lisp, Python, others?) • Perl sigtrap (prints a perl stack trace on fatal signal) • Exception handling • Asynchronous exception handling in programming languages (ML, Haskell, etc.) • PyDebug (persistent breakpoints, interface to gdb from Python) More references? • Debugging and exception handling are huge topics • Not aware of anything quite like WAD in scripting language world • Would be interested in getting feedback
  • 35.
    Future Directions A lotof people are building scripting language extensions • Debugging support can be greatly improved • Hope to see more work in this area WAD • Improve the core • Better interaction with threads and C++ • Support for more languages • Support for other platforms • Interaction with other debuggers and IDEs Bottom line • Still a lot of hard problems to work on
  • 36.
    More Information http://systems.cs.uchicago.edu/wad • Freely available under LGPL • This is work in progress • Volunteers welcome • I’m also looking for students