0

I run my own, small server here. The server runs on Ubuntu 18.04. There is one single HDD using LVM on a partition together with EXT4. LVM is used for taking snapshots. I also use Webmin with Virtualmin for administration.

During the past weeks, I were faced with some strange problems. I run this server for many years and I never had any serious data loss problems except for some rare cases where it was my own mistake.

A few weeks ago i tried to browse to one of my pages and encountered an error message like "the file system needs cleaning".

Ok, I have googled for it and I have run e2fsck on my LVM volume. It found several errors and fixed them. Unfortunately, after fixing these errors, there was a loss of one of the server's web directories. Thanks to my backup concept I was able to restore all data.

The server was up and running again... Some weeks later, I encountered a breach into my WordPress instance due to a bad plugin. I have got the wp-tmp.php malware https://stackoverflow.com/questions/52897669/what-can-do-virus-wp-tmp-php-on-wordpress

After the detection of this breach, I have changed all relevant passwords and moved the whole folder out of the reachability from the web... Due to the fact that every web project is assigned to its own account on the server, I hope that this script (which has shown some javascript to the user) was not able to do a lot of damage...

One week later I just recognized that another directory was completely missing (another user). e2fschk again there were also errors about missing or corrupted inodes that needed to get fixed.

Now I am asking my self the following question:

  1. What can cause such a significant EXT4 data loss?
  2. Can it be related to the fact, that I do LVM snapshots every midnight and backup the snapshot to an external drive? (I have read about problem using LVM and snapshots when there is an HDD Cache enabled)
  3. Are there any monitoring tools for such behaviors? I would like to be able to trace all the things that happened before the files were lost or the EXT4 has gone corrupt... Is there anything like that?

Thank you!

---- Update: 05.10.2020 -------

Here is a syslog excerpt

Oct 1 00:00:09 dtbsrv1 kernel: [565918.456000] EXT4-fs (dm-3): 9 orphan inodes deleted Oct 1 00:00:09 dtbsrv1 kernel: [565918.456001] EXT4-fs (dm-3): recovery complete Oct 1 00:00:09 dtbsrv1 kernel: [565918.743753] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null) Oct 1 21:11:54 dtbsrv1 kernel: [642222.440081] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D. Oct 1 21:11:54 dtbsrv1 kernel: [642222.440085] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 25: comm php-fpm7.2: Directory block failed checksum Oct 1 21:11:54 dtbsrv1 kernel: [642222.686629] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D. Oct 1 21:11:54 dtbsrv1 kernel: [642222.686631] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 25: comm php-fpm7.2: Directory block failed checksum Oct 1 21:37:01 dtbsrv1 kernel: [643730.020412] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D. Oct 1 21:37:01 dtbsrv1 kernel: [643730.020416] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 24: comm php-fpm7.2: Directory block failed checksum Oct 1 21:37:02 dtbsrv1 kernel: [643730.244533] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D. Oct 1 21:37:02 dtbsrv1 kernel: [643730.244537] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 24: comm php-fpm7.2: Directory block failed checksum Oct 1 22:57:24 dtbsrv1 kernel: [648552.977881] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D. Oct 1 22:57:24 dtbsrv1 kernel: [648552.977885] EXT4-fs error (device dm-0): ext4_dx_find_entry:1525: inode #19925296: block 1297: comm php-fpm7.2: Directory block failed checksum Oct 1 22:57:25 dtbsrv1 kernel: [648553.463297] EXT4-fs warning (device dm-0): ext4_dirent_csum_verify:367: inode #19925296: comm php-fpm7.2: No space for directory leaf checksum. Please run e2fsck -D. 

This message has occured without any special previous condition.

Here are the SMART results:

Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 071 061 006 Pre-fail Always - 72097400 3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 51 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 571428862 9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 15825 (102 53 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 51 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 059 049 040 Old_age Always - 41 (Min/Max 39/41) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 1 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 33 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 95 194 Temperature_Celsius 0x0022 041 051 000 Old_age Always - 41 (0 17 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 15824 (80 28 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 16720897520 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 17531397406 254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0 

Its a seagate HDD. Therefore, RAW_RAD_ERROR and Seek_Error_Rate have a special format... See here: https://forums.unraid.net/topic/31038-solved-seagate-with-huge-seek-error-rate-rma/

Here is the df -h output

[srvadmin@dtbsrv1 ~]# df -h df: /mnt/restic: Transport endpoint is not connected Filesystem Size Used Avail Use% Mounted on udev 3.9G 0 3.9G 0% /dev tmpfs 786M 3.7M 782M 1% /run /dev/mapper/ubuntu--vg-ubuntu--lv 590G 264G 297G 48% / tmpfs 3.9G 4.0K 3.9G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup /dev/loop0 97M 97M 0 100% /snap/core/9804 /dev/loop1 98M 98M 0 100% /snap/core/9993 /dev/sda2 976M 212M 698M 24% /boot overlay 590G 264G 297G 48% /var/lib/docker/overlay2/a404062f5a9eef43425c25238a9f4f82a144d94046ac9addace7e3c70c4934e4/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/f1a6f65efc0ef172471ff367da1a35a9d7debbcd75229653730a51b7fa30d38e/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/7c47723744bbdcfe8a2c809cdea9bb52f5fcb17ed22c81e37f37d205776c6237/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/24fa352fb9497e097754e41dfa22fce703a2067e5668bb692310e3485fa7e106/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/2c2355d6e3d1eabfb5b0db7a2a85c34b2a5a3056bcc8b574ec3eda2f55549c0b/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/53335cdd36121c2cb6a28a5bf6e287d6a24501b13eb901f361eeb49f5ff229cd/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/6f97bdc49212ebb61ea263c27fff544dfb9345bdc12e5f212796d5183e250368/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/07bb3f1692502da5d202c70abb48c9fbdc8388804170403302d607eacf44c8e1/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/14ee0758a510385a6ed6dd97c538fea275ff68d5c20c15cb1a4638c2e3b3b243/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/cd8f7f02d0aad720d5171cd013471d69fabda2b60bdbd0563c5db9f71f2e90cb/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/59aafbca3b59180e04d992eabfee852c6cbf6d68f9508418985a890a8af3ee62/merged overlay 590G 264G 297G 48% /var/lib/docker/overlay2/1c95f755df040e8be68bf556ae2edd06dbc88d4474b1e924adbe9c4572e49679/merged shm 64M 0 64M 0% /var/lib/docker/containers/ba4fcf7feee3a2abccf70fb28cf938800df307e9a607574568a67f61bb0e29f8/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/12987bacd86368ec8b55cad121609e0e5495b2de98a189446ea835327708265b/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/32784c91bf4662a6b395faab020590a401e38dc7c02271fcfa983a0bcad3c9b5/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/8ff1625130226700ef069e228041e1322169fc2146d33b27e1593d81c3c08e6b/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/5ba9140321d72192a4ed4b888cee2859261c3dfb49c3e57f60c163f030425f5e/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/086463232e2bfe3af088bf123a8f6a9768558bbd9ae2498fbdfbd6f6a3e03894/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/5dbce8c1c3456e214d626b0c21be241d936f79153ddc00874350ec3446d904a5/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/abc27059eea71599c6a4068c6365c7aae156c69aaba7ed1fc8f44ec405715f60/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/edf45d83317d69ffc5909b0e3222d82f25bba6917d5851d573e47517975f3efc/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/e802192919f47fb359a726aa1590faf7cb7fbe6340111899ea623f00fdf05e62/mounts/shm shm 64M 0 64M 0% /var/lib/docker/containers/eb4f624f4a0d3ab646d6c03cf5eafb9e2833554ac5ed6319c370babd7ea96957/mounts/shm shm 64M 8.0K 64M 1% /var/lib/docker/containers/2fa8ffce0fe4580d6733e939beec08118a9e4bdbfd695485d7631c9e006b3ddc/mounts/shm overlay 590G 264G 297G 48% /var/lib/docker/overlay2/835ed57e1c092dd1b8431618b4ae472f67cf2b867dd3bed987864cff4ddc87e7/merged shm 64M 0 64M 0% /var/lib/docker/containers/b42dfff9ab9b971f175a3d0f8878731be1bea838f580f273aadef7c534f82b73/mounts/shm 

Any ideas from your side?

Thanks!

--- Update 2: 05.10.2020 ---- Looks like that someone had a simmilar issue: https://discourse.osmc.tv/t/ext4-dirent-csum-verify-no-space-for-directory-leaf-checksum/75772/15

The solution was: defective SATA -> USB Adapter. In my case this would mean: defective onboard SATA Hardware. Could it be "that simple"?

3
  • Do what the error message advised you to do. Commented Oct 5, 2020 at 14:22
  • I have done that several times... snd a few hours or one day later i get the same message again Commented Oct 5, 2020 at 16:58
  • Are you quite sure you used -D? You didn't mention doing this anywhere in your post. Commented Oct 5, 2020 at 17:11

2 Answers 2

2

Possible you have multiple independent problems at once, compromise of your applications, or failing storage, or something else.

Carefully read How do I deal with a compromised server? Hope is not a plan. In general, the only way to be sure malware is gone is to completely delete the system, and reinstall the OS and applications from known good copies. For example, re-download good copies of all your WordPress plugins. And change all passwords and other credentials. Be very sure you know root cause and the extent of the infection before setting for less.

Regarding storage, check the disk's health attributes, such as with smartctl. On any indication of serious wear, replace it. Even on a single disk not-array, LVM allows you to migrate disks with pvmove. A widely deployed file system like ext4 is well tested, but relies on the storage hardware it is stored on, which will eventually fail. Or, it is possible that malware altered or deleted data, as the extent of that infection does not sound like it was conclusively proven.

Review any backup copies, check if you can see before and after these events. Read logs to see if the kernel printed anything interesting about storage to syslog or the journal. Possible this does not prove anything conclusively, a lot happens to files that just doesn't need to get logged or included in backups.

Should you desire better security and integrity tools, you'll have to search for those yourself. There are whole categories of file integrity monitoring software, either by auditing changes, or verifying integrity of files. WordPress has its own specialty security software and professional consulting, if that is something you wish to purchase.

1
  • Thanks for your answer. I have updated my original post in the meantime. It looks that there is some serious ext4 problem.... Commented Oct 5, 2020 at 9:40
0

I have found the solution:

I figured out, that the specific User on the LVMs logical volume had reached its Quota limit. This was the root cause of all these problems... All FSCHKs were not successfull or only for some hours, cause the quota has filled up again...

It looks like that reaching the quota can break the EXT4 structure....

Good to know for future problems...

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.