Desktop system crashed

wangming9 · September 8, 2025, 11:13am

During normal operation, the command “nvpmodel -q --v” caused the system to freeze.
There is no display when connecting the desktop to HDMI.
The network is normal and SSH login is possible. Commands such as “ls” and “cat” can be executed normally.
Executing the “lspci” command yields no response.

wangming9 · September 8, 2025, 11:22am

The time point when the problem occurred and the abnormal log information：

2025-09-08T15:23:42.442905+08:00 localhost kernel: r8152 2-3.1.4:1.0: Direct firmware load for rtl_nic/rtl8153b-2.fw failed with error -2

2025-09-08T15:59:15.901965+08:00 localhost kernel: r8152 2-3.1.4:1.0: Direct firmware load for rtl_nic/rtl8153b-2.fw failed with error -2

kern.log (5.5 MB)

DavidDDD · September 9, 2025, 8:29am

Hi,

This is a known issue in JP7.0 (r38.2).
A fix will be included in an upcoming release in the next few weeks.
We appreciate your patience in the meantime.

Thanks

wangming9 · September 10, 2025, 1:54am

Thank you for your reply.
I would like to know what the specific reason for this issue is? Is it related to PCIe?
How can we follow up on and fix this issue?

liuting11 · September 26, 2025, 6:40am

Hi, David
i see R38.2.1 already released, but we will take some time to merge the code. can you give us the seperate patch to solve destop system crashed issue?

DavidDDD · September 26, 2025, 7:56am

Hi liutee,

Sorry, but we don’t have a separate patch for this issue.

Please use the R38.2.1 release instead.

Thank you for your understanding.

Thanks,

David

wangming9 · October 13, 2025, 2:12am

DevKit uses the R38.2.1 release to reproduce the desktop freeze issue

The test process is as follows:
10-10 20:50:00 Start stress test
10-11 10:20:00 End stress test, check system and desktop display are normal
10-11 10:30:00 - 20:45:00 No operation was performed
10-11 20:45:00 Determine desktop freeze

Check logs and find abnormal logs as follows:

2025-10-11T17:25:55.661249+08:00 localhost kernel: INFO: task nvidia-modeset/:1701 blocked for more than 120 seconds.

2025-10-11T17:25:55.688912+08:00 localhost kernel: INFO: task kworker/u28:0:363362 blocked for more than 120 seconds.

2025-10-11T17:25:55.709145+08:00 localhost kernel: INFO: task kworker/2:0:414595 blocked for more than 120 seconds.

2025-10-11T17:25:55.723242+08:00 localhost kernel: INFO: task kworker/u28:1:414787 blocked for more than 120 seconds.

2025-10-11T17:25:55.751098+08:00 localhost kernel: INFO: task nvpmodel:416304 blocked for more than 120 seconds.

Detailed logs are attached: kern.log

kern.log (1.8 MB)

wangming9 · October 15, 2025, 1:57am

Without any load, leave it overnight and it will reappear

2025-10-15T01:47:34.574085+08:00 localhost kernel: INFO: task nvidia-modeset/:1718 blocked for more than 120 seconds.
2025-10-15T01:47:34.580808+08:00 localhost kernel: INFO: task nv_queue:1735 blocked for more than 120 seconds.
2025-10-15T01:47:34.601736+08:00 localhost kernel: INFO: task kworker/u28:1:116948 blocked for more than 120 seconds.
2025-10-15T01:47:34.622316+08:00 localhost kernel: INFO: task kworker/u28:2:123804 blocked for more than 120 seconds.
2025-10-15T01:47:34.643317+08:00 localhost kernel: INFO: task kworker/11:0:128572 blocked for more than 120 seconds.
2025-10-15T01:47:34.663548+08:00 localhost kernel: INFO: task nvpmodel:132778 blocked for more than 120 seconds.

lspci is stuck on reading /sys/bus/pci/devices/0000:01:00.0/config

After the stress test, lspci command error

WayneWWW · October 15, 2025, 10:42am

Do you have any peripheral connected? We’ve been putting our Thor devkit for like many weeks but didn’t reproduce this behavior.

wangming9 · October 15, 2025, 11:43am

1、USB hub for connecting mouse and keyboard
2、HDMI for connecting monitor

WayneWWW · October 15, 2025, 11:48am

Could you help test if connecting a HDMI is the key to this issue? Remove it and see if you could still reproduce this.

wangming9 · October 15, 2025, 12:54pm

OK. Retest without HDMI

WayneWWW · October 16, 2025, 3:50pm

Hi @wangming9

Once you confirmed whether HDMI disconnection would reproduce issue or not, please try this command with reproducible setup and see if you could still reproduce issue. Thanks.

echo performance | sudo tee /sys/class/devfreq/gpu*/governor # confirm new governor is applied grep "" /sys/class/devfreq/gpu*/governor

liuting11 · October 21, 2025, 2:36am

Hi, Wayne
HDMI connect is not the key to this issue. in our thor device(not devkit), we can reproduce this issue without HDMI connected. later we will try reproduce this issue at devkit without HDMI connected.

liuting11 · October 21, 2025, 3:13am

hi,

2025-10-21T00:39:18.775061+08:00 localhost kernel: INFO: task kworker/2:1:128 blocked for more than 120 seconds.

2025-10-21T00:39:18.775126+08:00 localhost kernel: Tainted: G W O 6.8.12-tegra #1
2025-10-21T00:39:18.775134+08:00 localhost kernel: “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
2025-10-21T00:39:18.775137+08:00 localhost kernel: task:kworker/2:1 state:D stack:0 pid:128 tgid:128 ppid:2 flags:0x00000008
2025-10-21T00:39:18.775139+08:00 localhost kernel: Workqueue: pm pm_runtime_work
2025-10-21T00:39:18.775143+08:00 localhost kernel: Call trace:
2025-10-21T00:39:18.775145+08:00 localhost kernel: __switch_to+0xe0/0x110
2025-10-21T00:39:18.775148+08:00 localhost kernel: __schedule+0x368/0xc14
2025-10-21T00:39:18.775151+08:00 localhost kernel: schedule+0x34/0xd8
2025-10-21T00:39:18.775153+08:00 localhost kernel: schedule_preempt_disabled+0x24/0x48
2025-10-21T00:39:18.775156+08:00 localhost kernel: __mutex_lock.constprop.0+0x2dc/0x580
2025-10-21T00:39:18.775158+08:00 localhost kernel: __mutex_lock_slowpath+0x14/0x28
2025-10-21T00:39:18.775161+08:00 localhost kernel: mutex_lock+0x50/0x64
2025-10-21T00:39:18.775164+08:00 localhost kernel: devfreq_monitor_suspend+0x20/0xa8
2025-10-21T00:39:18.775167+08:00 localhost kernel: 0xffffc456ce5e0a9c
2025-10-21T00:39:18.775170+08:00 localhost kernel: devfreq_suspend_device+0x50/0x104
2025-10-21T00:39:18.775172+08:00 localhost kernel: nv_set_gpu_pg_mask+0x42c/0x2ca8 [nvidia]
2025-10-21T00:39:18.775175+08:00 localhost kernel: nvidia_isr_kthread_bh+0x754/0x808 [nvidia]
2025-10-21T00:39:18.775178+08:00 localhost kernel: pci_pm_runtime_suspend+0x54/0x1c0
2025-10-21T00:39:18.775180+08:00 localhost kernel: genpd_runtime_suspend+0xa8/0x25c
2025-10-21T00:39:18.775183+08:00 localhost kernel: __rpm_callback+0x48/0x1d8
2025-10-21T00:39:18.775186+08:00 localhost kernel: rpm_callback+0x74/0x80
2025-10-21T00:39:18.775189+08:00 localhost kernel: rpm_suspend+0x114/0x66c
2025-10-21T00:39:18.775191+08:00 localhost kernel: pm_runtime_work+0xdc/0xe0
2025-10-21T00:39:18.775194+08:00 localhost kernel: process_one_work+0x170/0x424
2025-10-21T00:39:18.775197+08:00 localhost kernel: worker_thread+0x328/0x440
2025-10-21T00:39:18.775199+08:00 localhost kernel: kthread+0x110/0x124
2025-10-21T00:39:18.775202+08:00 localhost kernel: ret_from_fork+0x10/0x20

cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status

suspending

now the gpu is in suspending status, seems it cause the desktop system crashed. and devfreq program cause the gpu enter suspending status.
we will try to disable devfreq program through “echo performance | sudo tee /sys/class/devfreq/gpu*/governor”

WayneWWW · October 21, 2025, 3:50am

Hi,

如果你要更新狀況的話麻煩也提供一下你們的環境. 比方說是在NV devkit還是你們自己的板子.

基本上我們希望現在所有狀況都在NV devkit上複製就好.

liuting11 · October 21, 2025, 11:52am

Hi, Wayne
上面的log都是在devkit R38.2.1版本上复现的，不过由于devkit的数量有限，我们也在尝试自己的板子上复现，现在有个现象，我们的板子上有两块thor芯片，两个系统。这两个系统的base kernel都是跟devkit对齐的，但是一个系统会有内核线程devfreq_wq，另外一个系统没有这个内核线程。想知道，是否有内核线程devfreq_wq，是由什么决定的？
因为这两个系统的gpu的governor是完全一样的，为什么一个系统有这个内核线程，另一个没有呢？现在确实是有这个内核线程的，并且没有disable dvfs的系统，复现桌面卡死的概率比较高。
目前在devkit上，我们都是disable dvfs，并且不接显示器，在压测，目前还没复现此问题。

WayneWWW · October 21, 2025, 11:59am

請你把沒有的那台機器dmesg分享一下.

liuting11 · October 21, 2025, 1:12pm

刚才确认了一下，我们自己两个系统的现象是一样的，这个内核线程不是一直有，都是开机5分钟之内不停地启动，停止，5分钟以后就不再启动了。我们的系统是R38.2.0
但是R38.2.1的devkit，这个内核线程是一直不停地启动，停止，5分钟以后也依然如此。所以现在devkit比我们的板子更容易复现桌面卡死的问题。
请帮忙确认下，这个devfreq_wq内核线程什么时候会停止启动？为什么R38.2.0跟R38.2.1的版本的现象不一样？
如下是我们的板子(也就是R38.2.0)的dmesg信息。
r38.2.0_dmesg.txt (131.7 KB)

WayneWWW · October 23, 2025, 3:32am

請問你所謂的一直啟動是什麼意思? 你觀察的方法是什麼?

麻煩都先回報最新的release看到的狀況就好.

Topic		Replies	Views
Linux Kernel Crashes under 260.19.21 Investigating Linux Kernel Crashes CUDA Programming and Performance	35	37747	February 1, 2011
System hangs with drivers 319.23, 319.32, 325.08 and others - simple test case included Linux	17	9578	July 1, 2014
34x/35x/36x freeze at reboot/shutdown/TTY switch, rcu_sched self-detected stall detected Linux	2	2353	April 10, 2016
Frequent X-server crashing Jetson TK1	20	6674	June 23, 2015
Frequent Freeze/Crash of Xorg with drivers 310.19 with GTS 250 on 3.2.0-4-amd64 Linux	20	16052	June 25, 2013
System freezes Linux	0	637	April 3, 2017
resume from suspend freezes system (GTX 970, Arch Linux, Kernel 4.4/4.7, NVIDIA 370) Linux	171	59122	June 18, 2017
Linux 3.10+ Driver crash Linux	186	119982	November 17, 2014
Strange freezes with Tesla C2050 - Help needed! Help needed!!!! CUDA Programming and Performance	63	7823	March 1, 2011
Dual NVIDIA P600 BaseMosaic freezes system if booting from UEFI with 390.42 during X restart Linux	30	3545	July 4, 2018

Desktop system crashed

2025-10-21T00:39:18.775061+08:00 localhost kernel: INFO: task kworker/2:1:128 blocked for more than 120 seconds.

cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status

Related topics