Skip to content

Conversation

@shym
Copy link
Contributor

@shym shym commented Feb 20, 2023

Global variable error_buffer is used to store a string that is returned to callers, so there is a race condition if dynamic linking is invoked from 2 OCaml domains in parallel
Since the error message must be returned, a mutex cannot be used to prevent the race condition
GNU libc uses that same solution: keep the last error in thread-local storage; so support for calling dlerror from a different thread than the one calling dlopen is not to be expected

The problem with parallel accesses was discovered investigating segfaults in ocaml-multicore/multicoretests#290.
With this patch, these tests do not segfault.
I see the following 4 dynlink tests from the ocaml test suite failing, but they seem to fail whether or not this patch is applied and the error message seems not related.

> run_win32.c:365: CreateProcess failed: The system cannot find the file specified. […] List of failed tests: tests/lib-dynlink-csharp /'main.ml' with 1.1.4.1.1.1.1 (script) tests/lib-dynlink-csharp /'main.ml' with 1.1.3.1.1.1 (script) tests/lib-dynlink-csharp /'main.ml' with 1.1.2.1.1.1.1 (script) tests/lib-dynlink-csharp /'main.ml' with 1.1.1.1.1.1 (script) 
@dra27 dra27 closed this Mar 2, 2023
@dra27 dra27 reopened this Mar 2, 2023
@dra27
Copy link
Member

dra27 commented Mar 2, 2023

(closed and re-opened to recompute the merge commit now that #115 is merged, so CI should have something useful to say!)

@dra27
Copy link
Member

dra27 commented Mar 3, 2023

Hmm, this is looking tedious... it looks like we're pulling in a runtime library somewhere from the error. I'm guessing that /usr/i686-w64-mingw32/sys-root/mingw/bin (and the x64 equivalent) need adding to PATH for this to work, although we should nail down exactly which DLL it's trying to pull in. That's a bit too heavy for this - we might instead need to hand-roll the native Windows version using TlsAlloc et al.

Pack together the current error code and message into a single structure Ease the transition to putting the error into thread-local storage
@shym
Copy link
Contributor Author

shym commented Mar 21, 2023

I updated the PR with an implementation using Tls* functions. This turned out a bit more involved than what I first thought, since TlsGetValue modifies the result of GetLastError which we want to preserve. I hope the comments are enough to clarify the implementation choices.

Reproducing locally the test that failed, I get a pop-up saying libwinpthreads-1.dll is missing, maybe too big a dependency for that single feature.

@nojb
Copy link
Collaborator

nojb commented Apr 15, 2023

Hello @shym: just a heads-up that I am planning to read this PR soon. Thank you for your patience!

nojb
nojb previously approved these changes Apr 18, 2023
Copy link
Collaborator

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this patch! The code is extremely clear and a pleasure to review :)

LGTM (modulo a small question)

@shym
Copy link
Contributor Author

shym commented Apr 19, 2023

Thanks you for your kind and thorough review!
I've updated the branch removing the goto.
After looking at another PR, I added an entry to CHANGES.

int flexdll_relocate(void *tbl) {
err_t * err;
err = get_tls_error(TLS_ERROR_RESET_LAST);
if(err == NULL) return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, shouldn't we reset err->code = 0 here? Otherwise, the check below in line 460 will fail if this function is called after another function that has set err->code.

Going one step further, perhaps when in TLS_ERROR_RESET_LAST mode, we should always set err->code = 0. Or is there a case where we want to reset one of the error codes, but not the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, thank you very much!

I think I ended up with that code because it was not explicitly reset in the original code. That was arguably correct because flexdll_relocate is called from two places (if I didn't miss any other):

  • from flexdll_dlopen where code has already been reset,
  • from flexdll_init where code has been set to its initial value 0.
    This made me realize that I had forgotten to initialize the values when they are malloc-ed!

So I’ve rewritten the code so that:

  • on TLS_ERROR_RESET, both code and last_error are reset (so no _LAST),
  • the explicit reset of code near the call to get_tls_error(TLS_ERROR_RESET) are removed,
  • the structure is initialized right after malloc, just to make sure; the structure should be malloc-ed on a call to one of the initialisation entrypoints, in which case it will be initialised again just a few lines later, but that will ensure that a buggy program calling dlerror without a previous call to dlopen will get a reliable reasonable behaviour.
Copy link
Collaborator

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back-and-forth, but it turns out that there is a function SetLastError https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-setlasterror
Couldn't we use that instead to restore the result of the call to GetLastError after calling the Tls* functions? It should simplify the code (no need for the last_error field).

@nojb nojb dismissed their stale review April 20, 2023 11:54

Still polishing the patch

@shym
Copy link
Contributor Author

shym commented Apr 21, 2023

Very good idea indeed!
I noticed that the documentation for SetLastError explicit states that the last error is stored in TLS and that values with bit 29 set are reserved for user errors. But I didn’t find a trick to reuse that to fully skip using TLS explicitly ourselves, especially since POSIX’s dlerror must report the last error of a dl function, so other functions must not interfere with the result it will report.
So I just updated the patch removing the last_error field and merging all the non-resetting behaviours. It’s a lot simpler to read.

Copy link
Collaborator

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

flexdll.c Outdated
switch (error) {
err_t * err;
err = get_tls_error(TLS_ERROR_NOP);
if(err == NULL) return "error in accessing thread-local storage";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(err == NULL) return "error in accessing thread-local storage";
if(err == NULL) return "error accessing thread-local storage";
flexdll.c Outdated
DWORD msglen;
err_t * err;
err = get_tls_error(TLS_ERROR_NOP);
if(err == NULL) return "error in accessing thread-local storage";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(err == NULL) return "error in accessing thread-local storage";
if(err == NULL) return "error accessing thread-local storage";
shym and others added 2 commits April 21, 2023 12:08
Move the last error into thread-local storage to avoid data races (and thus possible segmentation faults) when the code is used in a multithreaded setting Add a get_tls_error function to access explicitly the thread-local error to bypass limited compiler support for it (`__thread`, etc.) Pass explicitly the current error variable to internal functions to avoid calling get_tls_error when possible Document the mechanism used for TLS errors, to explain its unexpected complexity As a side-effect of that reorganisation of the code, the code of the last error is explicitly reset on all initialisation entry points (flexdll_dlopen, flexdll_wdlopen, flexdll_relocate), even when it was missing before Co-authored-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
@shym
Copy link
Contributor Author

shym commented Apr 21, 2023

Just changed the error message, and added due credit! 😄

@nojb nojb merged commit bae7593 into ocaml:master Apr 21, 2023
@nojb
Copy link
Collaborator

nojb commented Apr 21, 2023

Thanks, merged! (I took the liberty of squashing the commits into a single commit; this makes it easier to revert, cherry-pick, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants