Skip to content

Conversation

@pyrmont
Copy link
Contributor

@pyrmont pyrmont commented Sep 16, 2025

This PR fixes a problem I introduced via #1575, namely that when built on Windows platforms Janet can hang when ev/deadline is called with a function that infinitely loops. The fix is to use SetEvent rather than QueueUserAPC with an increasing back-off if a timer completes in less time than the user had specified.

Background

The Windows implementation of ev/deadline uses a combination of QueueUserAPC and SleepEx to allow a worker thread to interrupt the main thread running the Janet VM if a timeout occurs. While the implementations used on POSIX systems seem to work reliably, CI runs of the Janet project on Windows have, over the past six months, repeatedly hung during the testing of test/suite-ev.janet.

I am now relatively confident that the problem is a consequence of the resolution of timers on Windows. SleepEx (and WaitForSingleObject which this PR uses instead) claim to support millisecond precision. However, this precision is dependent on the length of the 'tick' of the system clock. If the amount of time is less than the tick, the function may return before the amount of time has elapsed. This creates a problem when testing test/suite-ev.janet because the second test of ev/deadline uses a timeout of 0.01 seconds (i.e. 10 milliseconds). What happens on a non-deterministic basis is that SleepEx (or WaitForSingleObject) will return after less than 10 milliseconds. This will cause the Janet VM to be interrupted and the event loop will iterate. However, on the next iteration, it is possible that the system clock is not at 10 milliseconds ahead and so the timeout in the Janet VM's queue will not trigger. The Janet VM will restart the fiber running the infinite loop and the system will hang.

An earlier version of this PR explored whether the problem related to the use of APC and even though that appears to not be the case, this PR uses event objects as a preferable signalling approach.

Implementation

This PR replaces the APC-based mechanism with an event object-based mechanism. It also introduces an increasing back-off mechanism in the timer. The key changes are in src/core/ev.c in janet_timeout_body:

- SleepEx((DWORD)(tto.sec * 1000), TRUE); - janet_interpreter_interrupt(tto.vm); - JanetEVGenericMessage msg = {0}; - janet_ev_post_event(tto.vm, janet_timeout_cb, msg); + JanetTimestamp wait_begin = ts_now(); + DWORD duration = (DWORD)round(tto.sec * 1000); + DWORD res = WAIT_TIMEOUT; + JanetTimestamp wait_end = ts_now(); + for (size_t i = 1; res == WAIT_TIMEOUT && (wait_end - wait_begin) < duration; i++) { + res = WaitForSingleObject(tto.cancel_event, (duration + i)); + wait_end = ts_now(); + } + /* only send interrupt message if result is WAIT_TIMEOUT */ + if (res == WAIT_TIMEOUT) { + janet_interpreter_interrupt(tto.vm); + JanetEVGenericMessage msg = {0}; + janet_ev_post_event(tto.vm, janet_timeout_cb, msg); + }

and in cfun_ev_deadline:

- HANDLE worker = CreateThread(NULL, 0, janet_timeout_body, tto, 0, NULL); + HANDLE cancel_event = CreateEvent(NULL, TRUE, FALSE, NULL); + if (NULL == cancel_event) { + janet_free(tto); + janet_panic("failed to create cancel event"); + } + tto->cancel_event = cancel_event; + HANDLE worker = CreateThread(NULL, 0, janet_timeout_body, tto, CREATE_SUSPENDED, NULL);

With these changes, I was able to get 10 runs on my fork of Janet to work without incident. I was not previously able to get more than 5 runs.

Limitations

The memory needs of the JanetThreadedTimeout struct and JanetTimeout struct are increased on Windows because an additional pointer needs to be allocated for the semaphore event.

Additionally, the implementation does not send a message for other return values that WaitForSingleObject can return. I think that's all right for WAIT_OBJECT_O but perhaps some kind of error should be raised if WAIT_FAILED is returned.

@pyrmont
Copy link
Contributor Author

pyrmont commented Sep 16, 2025

This comment related to a version of the PR before it was rewritten.

Judging from the fact it's been six minutes since I posted and three of the Windows CI runs are still going (and all are in test/suite-ev.janet) this obviously didn't work. I'll put it into draft and explore things a bit more.

@pyrmont pyrmont marked this pull request as draft September 16, 2025 02:37
@bakpakin
Copy link
Member

Thanks for looking into this. A month or two back I spent a Saturday looking into this and narrowed it down to the ev/deadline code and how thread cancellation was working, but was unable to fully understand the issue and come up with something that always worked. That said, I was able to easily repro by running just the ev/deadline test in a loop on my windows machine.

@pyrmont pyrmont marked this pull request as ready for review September 17, 2025 07:15
@pyrmont
Copy link
Contributor Author

pyrmont commented Sep 17, 2025

Huzzah! I'm pretty confident I've found the problem!

I have extensively rewritten the OP with more details. In short: it's a timer resolution issue on Windows (the same problem is not present on POSIX because it uses nanosleep and that has nanosecond precision).

@bakpakin
Copy link
Member

This looks good to me, but the fact that a timer issue can cause this is a bit disconcerting. Let's merge this but wait to close #1598

@bakpakin bakpakin merged commit 015e49c into janet-lang:master Sep 19, 2025
13 checks passed
@pyrmont pyrmont deleted the bugfix.avoid-apc-use branch September 19, 2025 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants