Understanding Run State of the Software
I created this page to document the thinking I’ve been through as part of https://voluntarily.atlassian.net/browse/VP-1466 .
Crash Reporting
Crash reporting is the first step in understanding what your software is doing.
Currently, the voluntarily platform has basic crash reporting in place, more work needs to be done to improve this. The provider for crash reporting is Raygun, we’ve asked Raygun if they are willing to help us out with this and they are keen to do that.
You can find the dashboard for the voluntarily express server here: https://app.raygun.com/crashreporting/215fk5g?#active
If you don’t have access, feel free to ask @Walter Lim or @Andrew Watkins.
Infrastructure Monitoring
The solution uses ECS to run docker containers.
It would be great to flow events from ECS via CloudWatch and into Slack or Service Desk. Docs on that here:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_monitoring.html
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html
Notifications
When an error is recieved, notifications will flow through the following places:
Source System | Error Type | Slack Channel |
---|---|---|
Express Server | Thrown Exceptions / API Level | |
React App | Not Implemented | |
AWS Infrastructure | Not Implemented |
We also want to define a flow where important notifications get fed into Jira Service Desk so that agents on that desk can act on notifications as they come through.
Statuspage
I’ve created a statuspage since we’re already using the Atlassian suite. I’ve invited @Walter Lim and @Andrew Watkins.
We can integrate with Jira Service Desk here: https://marketplace.atlassian.com/apps/1216079/statuspage-for-jira-service-desk?hosting=cloud&tab=overview
I’ve made an integration between slack and the #v-dev channel for now. We can easily add channels as and when needed via the statuspage console here: https://manage.statuspage.io/pages/zsfp5d6ygv87/slack