Skip to content

Commit 302dc52

Browse files
author
Ole John Aske
committed
Patch for bug#21109605
EXCESSIVE NDB KERNEL THREAD IS STUCK IN: PERFORMING SEND; WATCHDOG TERMINATE It was possible to end up in a livelock on the send_buffer mutex if send buffers became a limiting resource. This could be due to too little send buffers configured, slow/failing communication network such that all send buffers are filled, slow receiveres which does not consume what is sent and likely other reasons. In this situation (all?) worker threads will fail to allocate send buffer memory for the signals, and will attempt a ::forceSend() to free up space. At the same time the send thread will be very busy trying to send to the same node(s). All these threads will compete for taking the send_buffer mutex, which result in a livelock on it. Send threads stalled due to hitting this livelock will be reported by the watchdog as 'Stuck in Send' As 'stuck' send threads also held the global send_thread mutex they could even block worker threads trying to grab the same mutex while alerting send threads as part of a do_send request. Thus, even the worker threads got 'Stuck in Send' This patch does two things: 1) Code analysys revealed that the send thread does not need to hold the global send_thread mutex while grabbing the send_buffer mutex. By releasing this global mutex prior to locking the SB-mutex the *worker threads* will no more be 'Stuck in Send'. 2) Changed the send treads locking of the send_buffer mutex to use a trylock. If the try-locking failed, the node to be sent to is re-inserted last into the list of send-nodes in order to be retried later. This removed the 'Stuck in Send' condition for the send threads as well.
1 parent 39896fe commit 302dc52

File tree

1 file changed

+30
-22
lines changed
  • storage/ndb/src/kernel/vm

1 file changed

+30
-22
lines changed

storage/ndb/src/kernel/vm/mt.cpp

Lines changed: 30 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1314,8 +1314,8 @@ class thr_send_threads
13141314
/* Get a send thread which isn't awake currently */
13151315
struct thr_send_thread_instance* get_not_awake_send_thread(NodeId node);
13161316

1317-
/* Lock send_buffer for this node. */
1318-
void lock_send_node(NodeId node);
1317+
/* Try to lock send_buffer for this node. */
1318+
int trylock_send_node(NodeId node);
13191319

13201320
/* Perform the actual send to the node, release send_buffer lock.
13211321
* Return 'true' if there are still more to be sent to this node.
@@ -1589,11 +1589,11 @@ check_available_send_data(struct thr_data *not_used)
15891589
return !g_send_threads->data_available();
15901590
}
15911591

1592-
void
1593-
thr_send_threads::lock_send_node(NodeId node)
1592+
int
1593+
thr_send_threads::trylock_send_node(NodeId node)
15941594
{
15951595
thr_repository::send_buffer *sb = g_thr_repository->m_send_buffers+node;
1596-
lock(&sb->m_send_lock);
1596+
return trylock(&sb->m_send_lock);
15971597
}
15981598

15991599
bool
@@ -1795,30 +1795,38 @@ thr_send_threads::run_send_thread(Uint32 instance_no)
17951795
this_send_thread->m_awake = TRUE;
17961796

17971797
NodeId node;
1798-
while ((node = get_node(instance_no)) != 0 && // PENDING -> ACTIVE
1799-
globalData.theRestartFlag != perform_stop)
1798+
while (globalData.theRestartFlag != perform_stop &&
1799+
(node = get_node(instance_no)) != 0) // PENDING -> ACTIVE
18001800
{
1801-
this_send_thread->m_watchdog_counter = 6;
1802-
18031801
/**
18041802
* Multiple send threads can not 'get' the same
1805-
* node simultaneously. Thus, there will be no lock
1806-
* contentions between different send threads.
1807-
* However, lock is required by 'trp_callback protocol',
1808-
* and to protect reset_send_buffers(), and
1809-
* lock_/unlock_transporter()
1803+
* node simultaneously. Thus, we does not need
1804+
* to keep the global send thread mutex any longer.
1805+
* Also avoids worker threads blocking on us in
1806+
* ::alert_send_thread
18101807
*/
1811-
lock_send_node(node);
18121808
NdbMutex_Unlock(send_thread_mutex);
1809+
this_send_thread->m_watchdog_counter = 6;
18131810

1814-
const bool more = perform_send(node, instance_no);
1815-
/* We return with no locks or mutexes held */
1816-
1817-
/* Release chunk-wise to decrease pressure on lock */
1818-
this_send_thread->m_watchdog_counter = 3;
1819-
this_send_thread->m_send_buffer_pool.
1820-
release_chunk(g_thr_repository->m_mm, RG_TRANSPORTER_BUFFERS);
1811+
/**
1812+
* Need a lock on the send buffers to protect against
1813+
* worker thread doing ::forceSend, possible
1814+
* reset_send_buffers() and/or lock_/unlock_transporter().
1815+
* To avoid a livelock with ::forceSend() on an overloaded
1816+
* systems, we 'try-lock', and reinsert the node for
1817+
* later retry if failed.
1818+
*/
1819+
bool more = true;
1820+
if (likely(trylock_send_node(node) == 0))
1821+
{
1822+
more = perform_send(node, instance_no);
1823+
/* We return with no locks or mutexes held */
18211824

1825+
/* Release chunk-wise to decrease pressure on lock */
1826+
this_send_thread->m_watchdog_counter = 3;
1827+
this_send_thread->m_send_buffer_pool.
1828+
release_chunk(g_thr_repository->m_mm, RG_TRANSPORTER_BUFFERS);
1829+
}
18221830
/**
18231831
* Either own perform_send() processing, or external 'alert'
18241832
* could have signaled that there are more sends pending.

0 commit comments

Comments
 (0)