5

Our company has dozens of AWS accounts spread across several regions. It's particularly bad that in the dev accounts - a lot of EC2 instances were spun up for some reason or the other over the years that are either underutilized or not-used-at-all. There are several hundred EC2s to cleanup. Worse still, we're unable to determine who the owners of a lot of these instances are. And so I'm after some tips/tricks and tools to help build a picture of what was deployed over the years and develop a strategy to bring down cloud spend, improve security, etc.

I could employ a whole host of techniques to monitor instances externally (cost explorer, cloud trail/watch, r53, network access) or in-instance with tools, logs, process monitoring, etc. And this is likely what we'll have to do to build up a composite picture and not rely on one particular data point - but what I'm after are some practical tips/tricks/tools/methods you have used in a similar situation. What have you used and would recommend?

4
  • 3
    I think this is mostly a policy issue: Have at least some validation and approval for new instances. It doesn't have to be complex or onerous, just a name, purpose, and expected lifespan. Choose some interval to check these at and require the creators to affirm they are still required, and if they don't affirm the instances get killed. Commented Oct 13 at 16:56
  • 1
    OK, should have mentioned - I am coming in post-hoc/after-the-fact and have to do this cleanup retrospectively. Policies and guardrails are great going forward but don't help me with the past right. Commented Oct 13 at 17:05
  • 2
    Can they be paused as a "soft scream test"? Since the cat is so long out of the bag, the barn doors have been opened so long, etc, you're going to have to do the hardmode work. Start with a full catalog, note the ones you know, then start expanding and improving your knowledge using the tools you've already mentioned. There isn't some magic bullet here. Either the data is in AWS or in your internal records or it isn't. Commented Oct 13 at 17:29
  • First you need to define inactive criteria. Most organizations have multiple sources of data that can provide activity information. It isn't unusual for an organization to have 20 or 30 data sources. This could be authentications, backups, endpoint protection, configuration management. Usually a host or device with multiple sources with no or minimal activity could be inactive. Also it's almost standard procedure to have a full shutdown of cloud pre production environments on weekends, precisely to flush out the unused hosts. Commented 2 days ago

1 Answer 1

9

You can start this long and costy investigation with a logs, network monitoring and whatever other tools. But I would do it in a different way:

  1. Make official announcement that you are going to power off those EC2s in a month and delete them completely in 3-6 month. If you/your team need them please contact me/someone.

  2. Inform support teams so they can get ready for upcoming incidents.

  3. Prepare a list of actions that owner needs to take to be compliant with your processes (i.e. register instance in CMDB)

  4. Add reminders 1 week and 1 day in advance.

  5. Shutdown instances and wait for screaming.

5
  • 9
    Agreed. "Let's turn it off and see who screams" is often the best approach in such a scenario. Commented Oct 13 at 18:51
  • I’ve done several migrations and similar cleanups and can only agree, it’s easy to identify and get response from the people/teams/individuals who are still actively using resources and feel responsible. But for the remaining stuff: Sometimes the only remaining owners you can identify have moved on to different roles and simply don’t respond to whatever urgent emails you send, assuming their (too frequently nonexistent) replacement will respond. A more direct approach (a chat or call) for such contacts often will get a much better and more useful response rate. Labor intensive but worth it. Commented 2 days ago
  • 1
    For the remainder: shutting down the resources and waiting for the fallout (if any) is an effective last resort but requires a certain amount of management buy-in and preparation with regards to following the change management and stakeholder management procedures that are already in place. Because the odds are even that nobody reacts to a shut down and that one of the neglected VM’s runs a forgotten but business critical ”excel spreadsheet application.” Commented 2 days ago
  • 1
    If you want to be gentle (and safer) it may be better to initially block network rather than a full shutdown. Commented 2 days ago
  • Make sure you talk to your financial department. You don't want the instance that backups some important data once per fiscal year to shut down, with noone understanding an error message they get once per year, only to notice in 5 years that the IRS absolutely needs that data. Commented yesterday

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.