Gitlab On-call Run Books

This project provides a guidance for Infrastructure Reliability Engineers and Managers who are starting an on-call shift or responding to an incident. If you haven't yet, review the Incident Management page in the handbook before reading on.

On-Call

GitLab Reliability Engineers and Managers provide 24x7 on-call coverage to ensure incidents are responded to promptly and resolved as quickly as possible.

Shifts

We use PagerDuty to manage our on-call schedule and incident alerting. We currently have two escalation policies for , one for Production Incidents and the other for Production Database Assistance. They are staffed by SREs and DBREs, respectively, and Reliability Engineering Managers.

Currently, rotations are weekly and the day's schedule is split 12/12 hours with engineers on call as close to daytime hours as their geographical region allows. We hope to hire so that shifts are an 8/8/8 hours split, but we're not staffed sufficiently yet across timezones.

Joining the On-Call Rotation

When a new engineer joins the team and is ready to start shadowing for an on-call rotation, overrides should be enabled for the relevant on-call hours during that rotation. Once they have completed shadowing and are comfortable/ready to be inserted into the primary rotations, update the membership list for the appropriate schedule to add the new team member.

This pagerduty forum post was referenced when setting up the blank shadow schedule and initial overrides for on-boarding new team member

Checklists

EMOC
IMOC

To start with the right foot let's define a set of tasks that are nice things to do before you go any further in your week

By performing these tasks we will keep the broken window effect under control, preventing future pain and mess.

Things to keep an eye on

Issues

First check the on-call issues to familiarize yourself with what has been happening lately. Also, keep an eye on the #production and #incident-management channels for discussion around any on-going issues.

Alerts

Start by checking how many alerts are in flight right now

go to the fleet overview dashboard and check the number of Active Alerts, it should be 0. If it is not 0
- go to the alerts dashboard and check what is being triggered
- watch the #alerts, #alerts-general, and #alerts-gstg channels for alert notifications; each alert here should point you to the right runbook to fix it.
- if they don't, you have more work to do.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.

Prometheus targets down

Check how many targets are not scraped at the moment. alerts are in flight right now, to do this:

go to the fleet overview dashboard and check the number of Targets down. It should be 0. If it is not 0
- go to the [targets down list] and check what is.
- try to figure out why there is scraping problems and try to fix it. Note that sometimes there can be temporary scraping problems because of exporter errors.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.

Incidents

First: don't panic.

If you are feeling overwhelmed, escalate to the IMOC or CMOC.
Whoever is in that role can help you get other people to help with whatever is needed. Our goal is to resolve the incident in a timely manner, but sometimes that means slowing down and making sure we get the right people involved. Accuracy is as important or more than speed.

Roles for an incident can be found in the incident management section of the handbook

If you need to start an incident, you can post in the #incident channel(https://gitlab.slack.com/messages/CB7P5CJS1) If you use /start-incident - a bot will make and issue/google doc and zoom link for you.

Communication Tools

If you do end up needing to post and update about an incident, we use Status.io

On status.io, you can Make an incident and Tweet, post to Slack, IRC, Webhooks, and email via checkboxes on creating or updating the incident.

The incident will also have an affected infrastructure section where you can pick components of the GitLab.com application and the underlying services/containers should we have an incident due to a provider.

You can update incidents with the Update Status button on an existing incident, again you can tweet, etc from that update point.

Remember to close out the incident when the issue is resolved. Also, when possible, put the issue and/or google doc in the post mortem link.

Production Incidents

Roles

During an incident there are at least 2 roles, and one more optional

Production engineers will
- Open a war room on Zoom immediately to have high a bandwidth communication channel.
- Create a Google Doc to gather the timeline of events.
- Publish this document using the File, Publish to web... function.
- Make this document GitLab editable by clicking on the Share icon and selecting Advanced, Change, then On - GitLab.
- Tweet GitLab.com is having a major outage, we're working on resolving it in a Google Doc LINK with a link to this document to make the community aware.
- Redact the names to remove the blame. Only use team-member-1, -2, -3, etc.
- Document partial findings or guessing as we learn.
- Write a post mortem issue when the incident is solved, and label it with outage
The point person will
- Handle updating the @gitlabstatus account explaining what is going on in a simple yet reassuring way.
- Synchronize efforts accross the production engineering team
- Pull other people in when consultation is needed.
- Declare a major outage when we are meeting the definition.
- Post @channel, we have a major outage and need help creating a live streaming war room, refer to [runbooks-production-incident] into the #general slack channel.
- Post @channel, we have a major outage and need help reviewing public documents into the #marketing slack channel.
- Post @channel, we have a major outage and are working to solve it, you can find the public doc <here> into the #devrel slack channel.
- Move the war room to a paid account so the meeting is not time limited.
- Coordinate with the security team and the communications manager and use the breach notification policy to determine if a breach of user data has occurred and notify any affected users.
The communications manager will
- Setup a not time limited Zoom war room and provide it to the point person to move all the production engineers there.
- Setup Youtube Live Streaming int the war room following this Zoom guide (for this you will need to have access to the GitLab Youtube account, ask someone from People Ops to grant you so)
The Marketing representative will
- Review the Google Doc to provide proper context when needed.
- Include a note about how is this outage impacting customers in the document.
- Decide how to handle further communications when the outage is already handled.

General guidelines for production incidents.

Is this an emergency incident?
- Are we losing data?
- Is GitLab.com not working or offline?
- Has the incident affected users for greater than 1 hour?
Tweet in a reassuring but informative way to let the people know what's going on
Join the #production channel
Define a point person or incident owner, this is the person that will gather all the data and coordinate the efforts.
For emergency incidents define Roles
- Point person
  - in the #production channel: "@here I'm taking point" and pin the message for the duration of the emergency.
- Communications manager
- Marketing representative.
- Start a war room using zoom
- Share the link in the #production channel
- Stream the zoom call live. Streaming a Webinar on YouTube Live – Zoom Help Center
For non-emergency incidents.
- Establish who is the point person on the incident.
  - in the #production channel: "@here I'm taking point" and pin the message for the duration of the incident.
- Start a war room using zoom if it will save time
- Share the link in the #production channel
Organize:
- If intervention is required (i.e. a non self-healing service)
- Create a Google Doc to gather the timeline of events.
- Publish this document using the File, Publish to web... function.
- Make this document GitLab editable by clicking on the Share icon and selecting Advanced, Change, then On - GitLab.
If the point person needs someone to do something, give a direct command: @someone: please run this command
Be sure to be in sync - if you are going to reboot a service, say so: I'm bouncing server X
If you have conflicting information, stop and think, bounce ideas, escalate
Gather information when the incident is done - logs, samples of graphs, whatever could help figuring out what happened
Update the Production Oncall Log
If we lack monitoring or alerting Open an issue and label as monitoring, even if you close issue immediately. See handbook
Keep in mind GitLab's data breach notification policy and work with the security team to determine if a user data breach has occurred and if notification needs to be provided.
Once the incident is resolved, Tweet an update and let users know the issue is resolved.

References

Communication Guidelines

CRITICAL

Spend one minute and create issue for outage, don't forget about outage label as specified in handbook.

PostgreSQL

Frontend Services

Supporting Services

Gitaly

CI

ELK

mapper_parsing_exception errors

Non-Critical

Chef/Knife

Learning

Alerting and monitoring

CI

Access Requests

Deal with various kinds of access requests

Deploy

Work with the fleet and the rails app

Restore Backups

Work with storage

Mangle front end load balancers

Work with Chef

Work with CI Infrastructure

Work with Infrastructure Providers (VMs)

Manually ban an IP or netblock

Dealing with Spam

General procedures for fighting spam in snippets, issues, projects, and comments

Manage Marvin, our infra bot

Manage cog

Elasticsearch

Internal DNS

Managing internal DNS

Debug and monitor

Secrets

Working with Google Cloud secrets

Security

Other

Gitter

Manage Package Signing Keys

Manage Package Signing Keys

Other Servers and Services

GitHost / GitLab Hosted

Adding runbooks rules

Make it quick - add links for checks
Don't make me think - write clear guidelines, write expectations
Recommended structure
- Symptoms - how can I quickly tell that this is what is going on
- Pre-checks - how can I be 100% sure
- Resolution - what do I have to do to fix it
- Post-checks - how can I be 100% sure that it is solved
- Rollback - optional, how can I undo my fix

Contributing

Please see the contribution guidelines

Name		Name	Last commit message	Last commit date
Latest commit History 2,100 Commits
alertmanager		alertmanager
alerts-checker		alerts-checker
alerts		alerts
backup_scripts		backup_scripts
consoles		consoles
dashboards		dashboards
elastic-watcher		elastic-watcher
graphs		graphs
howto		howto
img		img
incidents		incidents
logging_bigquery_schemas		logging_bigquery_schemas
monitoring		monitoring
on-call/checklists		on-call/checklists
pingdom		pingdom
recordings		recordings
rules		rules
scripts		scripts
services		services
troubleshooting		troubleshooting
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

License

hichenxinyu/runbooks

Folders and files

Latest commit

History

Repository files navigation