Put error global variables into thread-local storage #112

shym · 2023-02-20T13:56:31Z

Global variable error_buffer is used to store a string that is returned to callers, so there is a race condition if dynamic linking is invoked from 2 OCaml domains in parallel
Since the error message must be returned, a mutex cannot be used to prevent the race condition
GNU libc uses that same solution: keep the last error in thread-local storage; so support for calling dlerror from a different thread than the one calling dlopen is not to be expected

The problem with parallel accesses was discovered investigating segfaults in ocaml-multicore/multicoretests#290.
With this patch, these tests do not segfault.
I see the following 4 dynlink tests from the ocaml test suite failing, but they seem to fail whether or not this patch is applied and the error message seems not related.

> run_win32.c:365: CreateProcess failed: The system cannot find the file specified. […] List of failed tests: tests/lib-dynlink-csharp /'main.ml' with 1.1.4.1.1.1.1 (script) tests/lib-dynlink-csharp /'main.ml' with 1.1.3.1.1.1 (script) tests/lib-dynlink-csharp /'main.ml' with 1.1.2.1.1.1.1 (script) tests/lib-dynlink-csharp /'main.ml' with 1.1.1.1.1.1 (script)

dra27 · 2023-03-02T17:08:29Z

(closed and re-opened to recompute the merge commit now that #115 is merged, so CI should have something useful to say!)

flexdll.c

dra27 · 2023-03-03T08:53:06Z

Hmm, this is looking tedious... it looks like we're pulling in a runtime library somewhere from the error. I'm guessing that /usr/i686-w64-mingw32/sys-root/mingw/bin (and the x64 equivalent) need adding to PATH for this to work, although we should nail down exactly which DLL it's trying to pull in. That's a bit too heavy for this - we might instead need to hand-roll the native Windows version using TlsAlloc et al.

Pack together the current error code and message into a single structure Ease the transition to putting the error into thread-local storage

shym · 2023-03-21T12:22:21Z

I updated the PR with an implementation using Tls* functions. This turned out a bit more involved than what I first thought, since TlsGetValue modifies the result of GetLastError which we want to preserve. I hope the comments are enough to clarify the implementation choices.

Reproducing locally the test that failed, I get a pop-up saying libwinpthreads-1.dll is missing, maybe too big a dependency for that single feature.

nojb · 2023-04-15T16:22:22Z

Hello @shym: just a heads-up that I am planning to read this PR soon. Thank you for your patience!

nojb

Thank you for this patch! The code is extremely clear and a pleasure to review :)

LGTM (modulo a small question)

flexdll.c

shym · 2023-04-19T10:08:54Z

Thanks you for your kind and thorough review!
I've updated the branch removing the goto.
After looking at another PR, I added an entry to CHANGES.

nojb · 2023-04-19T12:36:09Z

flexdll.c

 int flexdll_relocate(void *tbl) {
+ err_t * err;
+ err = get_tls_error(TLS_ERROR_RESET_LAST);
+ if(err == NULL) return 0;


Thinking more about this, shouldn't we reset err->code = 0 here? Otherwise, the check below in line 460 will fail if this function is called after another function that has set err->code.

Going one step further, perhaps when in TLS_ERROR_RESET_LAST mode, we should always set err->code = 0. Or is there a case where we want to reset one of the error codes, but not the other?

Very good point, thank you very much!

I think I ended up with that code because it was not explicitly reset in the original code. That was arguably correct because flexdll_relocate is called from two places (if I didn't miss any other):

from flexdll_dlopen where code has already been reset,

from flexdll_init where code has been set to its initial value 0.
This made me realize that I had forgotten to initialize the values when they are malloc-ed!

So I’ve rewritten the code so that:

on TLS_ERROR_RESET, both code and last_error are reset (so no _LAST),

the explicit reset of code near the call to get_tls_error(TLS_ERROR_RESET) are removed,

the structure is initialized right after malloc, just to make sure; the structure should be malloc-ed on a call to one of the initialisation entrypoints, in which case it will be initialised again just a few lines later, but that will ensure that a buggy program calling dlerror without a previous call to dlopen will get a reliable reasonable behaviour.

nojb

Sorry for the back-and-forth, but it turns out that there is a function SetLastError https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-setlasterror
Couldn't we use that instead to restore the result of the call to GetLastError after calling the Tls* functions? It should simplify the code (no need for the last_error field).

Still polishing the patch

shym · 2023-04-21T09:36:51Z

Very good idea indeed!
I noticed that the documentation for SetLastError explicit states that the last error is stored in TLS and that values with bit 29 set are reserved for user errors. But I didn’t find a trick to reuse that to fully skip using TLS explicitly ourselves, especially since POSIX’s dlerror must report the last error of a dl function, so other functions must not interfere with the result it will report.
So I just updated the patch removing the last_error field and merging all the non-resetting behaviours. It’s a lot simpler to read.

nojb

Looks good to me, thanks!

nojb · 2023-04-21T09:44:48Z

flexdll.c

- switch (error) {
+ err_t * err;
+ err = get_tls_error(TLS_ERROR_NOP);
+ if(err == NULL) return "error in accessing thread-local storage";


Suggested change

if(err == NULL) return "error in accessing thread-local storage";

if(err == NULL) return "error accessing thread-local storage";

nojb · 2023-04-21T09:45:10Z

flexdll.c

+ DWORD msglen;
+ err_t * err;
+ err = get_tls_error(TLS_ERROR_NOP);
+ if(err == NULL) return "error in accessing thread-local storage";


Suggested change

if(err == NULL) return "error in accessing thread-local storage";

if(err == NULL) return "error accessing thread-local storage";

Move the last error into thread-local storage to avoid data races (and thus possible segmentation faults) when the code is used in a multithreaded setting Add a get_tls_error function to access explicitly the thread-local error to bypass limited compiler support for it (`__thread`, etc.) Pass explicitly the current error variable to internal functions to avoid calling get_tls_error when possible Document the mechanism used for TLS errors, to explain its unexpected complexity As a side-effect of that reorganisation of the code, the code of the last error is explicitly reset on all initialisation entry points (flexdll_dlopen, flexdll_wdlopen, flexdll_relocate), even when it was missing before Co-authored-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>

shym · 2023-04-21T10:11:48Z

Just changed the error message, and added due credit! 😄

nojb · 2023-04-21T11:20:35Z

Thanks, merged! (I took the liberty of squashing the commits into a single commit; this makes it easier to revert, cherry-pick, etc.)

dra27 closed this Mar 2, 2023

dra27 reopened this Mar 2, 2023

dra27 reviewed Mar 2, 2023

View reviewed changes

flexdll.c Outdated Show resolved Hide resolved

shym force-pushed the error-in-tls branch from 611d5c5 to 4e88227 Compare March 3, 2023 07:49

Group error code and message into a structure

9b046a5

Pack together the current error code and message into a single structure Ease the transition to putting the error into thread-local storage

shym force-pushed the error-in-tls branch from 4e88227 to 6a5438a Compare March 21, 2023 11:13

nojb previously approved these changes Apr 18, 2023

View reviewed changes

flexdll.c Outdated Show resolved Hide resolved

shym force-pushed the error-in-tls branch from 6a5438a to 95473e8 Compare April 19, 2023 09:57

nojb reviewed Apr 19, 2023

View reviewed changes

shym force-pushed the error-in-tls branch from 0574f2a to 41e65c6 Compare April 19, 2023 16:57

nojb reviewed Apr 20, 2023

View reviewed changes

shym force-pushed the error-in-tls branch from 41e65c6 to e555a84 Compare April 21, 2023 09:35

nojb approved these changes Apr 21, 2023

View reviewed changes

shym and others added 2 commits April 21, 2023 12:08

Add a CHANGES entry

fe87994

shym force-pushed the error-in-tls branch from e555a84 to fe87994 Compare April 21, 2023 10:11

nojb merged commit bae7593 into ocaml:master Apr 21, 2023

shym deleted the error-in-tls branch April 21, 2023 12:48

jmid mentioned this pull request Mar 15, 2024

[ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW+MSVC ocaml-multicore/multicoretests#307

Closed

jmid mentioned this pull request Mar 22, 2024

Parallel Dynlink usage under Cygwin+MinGW is unsafe ocaml/ocaml#13046

Closed

jmid mentioned this pull request Apr 16, 2024

Fix parallel access crashes and misbehavior #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Put error global variables into thread-local storage #112

Put error global variables into thread-local storage #112

Uh oh!

shym commented Feb 20, 2023

dra27 commented Mar 2, 2023

Uh oh!

dra27 commented Mar 3, 2023

shym commented Mar 21, 2023

nojb commented Apr 15, 2023

nojb left a comment

Uh oh!

shym commented Apr 19, 2023 •

edited

Loading

nojb Apr 19, 2023

shym Apr 19, 2023

nojb left a comment

shym commented Apr 21, 2023

nojb left a comment

nojb Apr 21, 2023

nojb Apr 21, 2023

shym commented Apr 21, 2023

nojb commented Apr 21, 2023

Labels

3 participants

	if(err == NULL) return "error in accessing thread-local storage";
	if(err == NULL) return "error accessing thread-local storage";

Put error global variables into thread-local storage #112

Put error global variables into thread-local storage #112

Uh oh!

Conversation

shym commented Feb 20, 2023

dra27 commented Mar 2, 2023

Uh oh!

dra27 commented Mar 3, 2023

shym commented Mar 21, 2023

nojb commented Apr 15, 2023

nojb left a comment

Choose a reason for hiding this comment

Uh oh!

shym commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nojb Apr 19, 2023

Choose a reason for hiding this comment

shym Apr 19, 2023

Choose a reason for hiding this comment

nojb left a comment

Choose a reason for hiding this comment

shym commented Apr 21, 2023

nojb left a comment

Choose a reason for hiding this comment

nojb Apr 21, 2023

Choose a reason for hiding this comment

nojb Apr 21, 2023

Choose a reason for hiding this comment

shym commented Apr 21, 2023

nojb commented Apr 21, 2023

Labels

3 participants

shym commented Apr 19, 2023 •

edited

Loading