0

Problem:

Should start off with saying that this is for a software house, and it's internal. None of the guys are "users", they are all staff.

  • We test on servers, including upgrading existing installations to prove that the upgrade process works etc.
  • People sometimes log into these servers to do testing of changes.
  • They don't put it back to how it is expected to be in production, meaning the environment is now considered "dirty".
    • Checking this is considerably more than "This app is installed correctly." This is much more about ensuring the following, as a subset, match production:
      • network interfaces and routing
      • configuration files
      • packages deployed to the server
      • the scripts on the server
      • VM config
      • disk usage, permissions, locations etc.
      • stuff I've not thought of
  • This costs time and money trying to find out why the action didn't work as expected.
  • For those who've helpfully suggested CI/CD to solve this (which I'm a big fan of and agree with for every other use case) a "wipe and redeploy" takes around 4 hours.

Question:

  • Is there a method of easily verifying that an installation is "as it should be"?
  • Note things like md5summing the disk isn't going to help as there are time dependent files on there.
  • Note that if someone monkeys around with the server, I don't care provided they put it back. Meaning that file timestamps are not going to help with this.

Before I get my hands dirty scripting an endless list of hash checks of all "essential" files, of which I am going to miss one or two at least and someone will natural change those and only those ones (all of the angry emojis) is there a better way of doing this that I can append to a build or upgrade script that will let me know if the installation is reliable or not?

5
  • 2
    What is your understanding of "as it should be"? The system is configured for specific purpose, the packages are not changed, the system is up to date.....? Commented May 11, 2023 at 14:38
  • 1
    Can you give a concrete example ? If you configured an app on a server, that app config should not be changed by the users, they should only be able to use it ''as is''..... Commented May 11, 2023 at 15:07
  • 5
    Although it comes with a big learning curve when you start from scratch: you want (both your test and production) systems to become more like cattle and less like pets. That means introducing automated deployments and centralised configuration management. - That makes wiping a test server to remove any and all test artefacts and re-installing to a particular baseline and consistent desired state quick and easy. That allows you to prepare for the automated upgrades and achieving a new desired state of your production environments. Commented May 11, 2023 at 15:44
  • Fully agree, you are absolutely right. However in this case for a fresh deploy we're looking at around 4-6 hours, whereas upgrades take around a quarter of that. Commented May 11, 2023 at 15:51
  • Are btrfs/ZFS snapshots something that may be considered? You would merely have to revert to a previous snapshot which would be a matter of minutes. Commented May 11, 2023 at 22:15

2 Answers 2

0

To verify an installation is "as it should be," you could try AIDE. https://aide.github.io/

In my experience it takes a fair amount of configuration to identify files/paths that should be ignored because there are often changes, and you get a lot of false positives before you get everything set up right. But it's definitely better than trying to roll your own system integrity checker, which would run into the same problems.

0

I think that you can relay on sosreport tool to solve this problem. sosreport is an opensource tool available in almost any Linux distro that collects vital diagnostics information like logs, configuration settings, hardware details and commands outputs from across the system. If you generate a sos report after a fresh install (or after the server is in its correct state), then you will be able to generate a new sosreport after the mess has happened and compare the two reports with a tool like sos-vault (sos-vault.com) or something similar and it will highlight where the problem is so it can be fixed or even restored from the first sosreport. My two cents :-).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.