Understanding Run State of the Software

I created this page to document the thinking I’ve been through as part of https://voluntarily.atlassian.net/browse/VP-1466 .

Crash Reporting

Crash reporting is the first step in understanding what your software is doing.

Currently, the voluntarily platform has basic crash reporting in place, more work needs to be done to improve this. The provider for crash reporting is Raygun, we’ve asked Raygun if they are willing to help us out with this and they are keen to do that.

You can find the dashboard for the voluntarily express server here: https://app.raygun.com/crashreporting/215fk5g?#active

If you don’t have access, feel free to ask @Walter Lim or @Andrew Watkins.

Infrastructure Monitoring

The solution uses ECS to run docker containers.

It would be great to flow events from ECS via CloudWatch and into Slack or Service Desk. Docs on that here:

Notifications

When an error is recieved, notifications will flow through the following places:

Source System

Error Type

Slack Channel

Source System

Error Type

Slack Channel

Express Server

Thrown Exceptions / API Level

#v-server-alerts

React App

Not Implemented

#v-capp-alerts

AWS Infrastructure

Not Implemented

#v-infra-alerts

We also want to define a flow where important notifications get fed into Jira Service Desk so that agents on that desk can act on notifications as they come through.

Statuspage

I’ve created a statuspage since we’re already using the Atlassian suite. I’ve invited @Walter Lim and @Andrew Watkins.

We can integrate with Jira Service Desk here: https://marketplace.atlassian.com/apps/1216079/statuspage-for-jira-service-desk?hosting=cloud&tab=overview

I’ve made an integration between slack and the #v-dev channel for now. We can easily add channels as and when needed via the statuspage console here: https://manage.statuspage.io/pages/zsfp5d6ygv87/slack