Monitoring revamp
Context
The Tiki community manages several Tiki-powered sitesAnd a few non-Tiki sites, but this is not that important for this page.. Ref:
All these sites are on various servers, and managed by various people. When one site goes down or has poor performance, it is usually not easy to find the root cause and solve it. The main problem is that we are somewhat blind
We do have a Zabbix server, but as of 2021-07-19, it reports a lot of noise. We know we need to move to better servers. And for some messages: what are supposed to do? See below:
Friday, 16 July 2021
(04:21) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109333
(04:25) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Up (1)
Original event ID: 109333
(04:25) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109334
(16:27) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Processor load is too high on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (server.promo.suite.wiki:system.cpu.loadpercpu,avg1): 4.48
Original event ID: 109349
(16:31) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Processor load is too high on server.promo.suite.wiki
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (server.promo.suite.wiki:system.cpu.loadpercpu,avg1): 4.27
Original event ID: 109349
Saturday, 17 July 2021
(02:08) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Free disk space is less than 20% on volume /
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Free disk space on / (percentage) (server.promo.suite.wiki:vfs.fs.size/,pfree): 20 %
Original event ID: 109355
(02:11) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Free disk space is less than 20% on volume /
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Free disk space on / (percentage) (server.promo.suite.wiki:vfs.fs.size/,pfree): 20 %
Original event ID: 109355
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109358
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Down (0)
Original event ID: 109359
(04:26) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109359
(21:44) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Too many processes on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Number of processes (tiki.suite.wiki:proc.num[]): 389
Original event ID: 109373
(21:44) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Too many processes on tiki.suite.wiki
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Number of processes (tiki.suite.wiki:proc.num[]): 389
Original event ID: 109373
Monday, 19 July 2021
(00:18) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (tiki.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109398
(00:18) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTPS service is down on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (tiki.suite.wiki:net.tcp.servicehttps): Down (0)
Original event ID: 109399
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Down (0)
Original event ID: 109401
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109400
(04:22) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109401
(04:23) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Up (1)
Original event ID: 109400
(05:34) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on tiki.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (tiki.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109399
(05:37) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Too many processes running on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Number of running processes (tiki.suite.wiki:proc.num,,run): 51
Original event ID: 109406
(05:39) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Processor load is too high on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (tiki.suite.wiki:system.cpu.loadpercpu,avg1): 9.3
Original event ID: 109407
(05:45) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Processor load is too high on tiki.suite.wiki
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (tiki.suite.wiki:system.cpu.loadpercpu,avg1): 1.506667
Original event ID: 109407
(06:42) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Host information was changed on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Information
Trigger URL:
Item values:
1. System information (tiki.suite.wiki:system.uname): Linux tiki.suite.wiki 5.12.2-x86_64-linode144 #1 SMP Mon May 10 13:10:23 EDT 2021 x86_64
Original event ID: 109414
(07:41) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Host information was changed on tiki.suite.wiki
Trigger status: OK
Trigger severity: Information
Trigger URL:
Item values:
1. System information (tiki.suite.wiki:system.uname): Linux tiki.suite.wiki 5.12.2-x86_64-linode144 #1 SMP Mon May 10 13:10:23 EDT 2021 x86_64
Original event ID: 109414
(14:49) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Free disk space is less than 20% on volume /
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Free disk space on / (percentage) (server.promo.suite.wiki:vfs.fs.size/,pfree): 20 %
Original event ID: 109416
Is it a
- A server issue?
- A network issue?
- A Tiki bug?
- A Tiki misconfiguration?
- A spike in traffic?
- A PHP bug? We struggled for over a year for random server crashes, and in the end, it was this: https://bugs.php.net/bug.php?id=71135
Different issues require different types of skill set to resolve.
We need better tools and processes. These will help:
- GlitchTip
- Let's add a trigger on slow pages
- The upcoming Virtualmin + Debian 10 infrastructure : https://gitlab.com/wikisuite/virtualmin-installer
- Real User Measurement
- A monitoring solution that informs the right people in real time. We have selected Zabbix and NetData and we need a better integration with Tiki, keeping it open-ended to integrate with any monitoring system.
Steps
1.1.1. Set up Zabbix (Fabio) and NetData (Horia)
- Sending alerts to XMPP room: xmpp:monitoring@conference.wikisuite.chat
- With generic server monitoring (disk space, CPU, etc.)
- Is OPcache covered? So we avoid this message in Tiki: "Little memory available. Thrashing likely to occur. The values to increase are apc.shm_size (for APC), xcache.size (for XCache) or opcache.memory_consumption (for OPcache)." If Zabbix doesn't do well, we should add a Tiki-specific alert
- Use NetData in Virtualmin
1.1.2. Improve Tiki code to provide Tiki-specific alert to Zabbix and NetData (help needed)
- Last successful index rebuild is more than 3 days old
- Site closed because of too much traffic
- Some checks in tiki-check.php are failing
- An account was locked by a brute force attack
- Installer is unlocked
- Over 200 files stored in the database
Here is some code, which needs a review and a revamp:
- https://gitlab.com/tikiwiki/tiki/-/blob/master/tiki-monitor.php
- https://gitlab.com/tikiwiki/tiki/-/blob/master/doc/devtools/check_tiki.php
- https://gitlab.com/tikiwiki/tiki/-/blob/master/doc/devtools/check_tiki-new.php
- https://gitlab.com/tikiwiki/tiki/-/blob/master/tiki-check.php (has some Zabbix code)
1.1.3. Improve Tiki manager code to provide specific alerts to Zabbix and NetData (help needed)
- instance:watch via Zabbix/NetData instead of via email
- Tiki is missing an update which includes a security fix. This will require
- A way to identify releases that have security fixes
- Warn if an automatic backup or update fails
1.1.4. Real User Measurement
- Real User Measurement
- tiki-performance-stats.php will provide a list of slowest pages, on which we can focus our energy.