We have a number of robots installed at various locations, and servicing customers. All robots get their instructions from a central cloud database with customer data, and each have an SQS queue which delivers the commands they have to execute, and the robots broadcast any events using SNS, and some lambdas are triggered by these SNS messages and handling them.
Now we want to have a better handling and overview of errors occurring on the robots and in generel have better statistics.
What we need is:
- Get an alarm when an error happens that requires manual action to recover.
- An overview of which types of errors that happens most.
- What errors happen before others (i.e. what error has lead us to a
recovery_errorwhich needs manual maintenance) Overall stats of the performance from a given period
- Number of successful sessions
- Failed sessions caused by user error
- Failed sessions caused by technical errors
- Errors where the robot cannot automatically recover and go back to initial position.
All messages have a type attribute which can be status, warning, error or recovery_error and a value attribute which describes the type of status, error etc.
My thought is to have a lambda that's subscribed to all SNS messages and will upload these to another system which we will then collect it all and provide what we need for extracting the data mentioned above.
Which AWS products would you recommend for this? I already looked a little at CloudWatch, but I'm not sure if it can cover our needs.
We have also considered just dumping all SNS messages into a database, and do custom queries on the tables. But that sounds like a solution that can quickly require a lot of work on our side, as our need grows.
We'd prefer an off the shelf solution and adjust our workflow to that.
Thanks in advance for any tips.