Loading...
 
Skip to main content

Monitoring revamp

Context

The Tiki community manages several Tiki-powered sitesAnd a few non-Tiki sites, but this is not that important for this page.. Ref:

All these sites are on various servers, and managed by various people. When one site goes down or has poor performance, it is usually not easy to find the root cause and solve it. The main problem is that we are somewhat blind

We do have a Zabbix server, but as of 2021-07-19, it reports a lot of noise. We know we need to move to better servers. And for some messages: what are supposed to do? See below:

Friday, 16 July 2021
(04:21) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109333
(04:25) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Up (1)
Original event ID: 109333
(04:25) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109334
(16:27) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Processor load is too high on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (server.promo.suite.wiki:system.cpu.loadpercpu,avg1): 4.48
Original event ID: 109349
(16:31) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Processor load is too high on server.promo.suite.wiki
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (server.promo.suite.wiki:system.cpu.loadpercpu,avg1): 4.27
Original event ID: 109349
Saturday, 17 July 2021
(02:08) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Free disk space is less than 20% on volume /
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Free disk space on / (percentage) (server.promo.suite.wiki:vfs.fs.size/,pfree): 20 %
Original event ID: 109355
(02:11) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Free disk space is less than 20% on volume /
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Free disk space on / (percentage) (server.promo.suite.wiki:vfs.fs.size/,pfree): 20 %
Original event ID: 109355
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109358
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Down (0)
Original event ID: 109359
(04:26) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109359
(21:44) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Too many processes on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Number of processes (tiki.suite.wiki:proc.num[]): 389
Original event ID: 109373
(21:44) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Too many processes on tiki.suite.wiki
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Number of processes (tiki.suite.wiki:proc.num[]): 389
Original event ID: 109373
Monday, 19 July 2021
(00:18) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (tiki.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109398
(00:18) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTPS service is down on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (tiki.suite.wiki:net.tcp.servicehttps): Down (0)
Original event ID: 109399
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Down (0)
Original event ID: 109401
(04:22) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Down (0)
Original event ID: 109400
(04:22) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (server.promo.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109401
(04:23) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTP service is down on server.promo.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTP service is running (server.promo.suite.wiki:net.tcp.servicehttp): Up (1)
Original event ID: 109400
(05:34) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: HTTPS service is down on tiki.suite.wiki
Trigger status: OK
Trigger severity: High
Trigger URL:
Item values:
1. HTTPS service is running (tiki.suite.wiki:net.tcp.servicehttps): Up (1)
Original event ID: 109399
(05:37) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Too many processes running on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Number of running processes (tiki.suite.wiki:proc.num,,run): 51
Original event ID: 109406
(05:39) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Processor load is too high on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (tiki.suite.wiki:system.cpu.loadpercpu,avg1): 9.3
Original event ID: 109407
(05:45) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Processor load is too high on tiki.suite.wiki
Trigger status: OK
Trigger severity: Warning
Trigger URL:
Item values:
1. Processor load (1 min average per core) (tiki.suite.wiki:system.cpu.loadpercpu,avg1): 1.506667
Original event ID: 109407
(06:42) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Host information was changed on tiki.suite.wiki
Trigger status: PROBLEM
Trigger severity: Information
Trigger URL:
Item values:
1. System information (tiki.suite.wiki:system.uname): Linux tiki.suite.wiki 5.12.2-x86_64-linode144 #1 SMP Mon May 10 13:10:23 EDT 2021 x86_64
Original event ID: 109414
(07:41) bot.zabbix at diablo.montefuscolo.com.br: OK
---
Trigger: Host information was changed on tiki.suite.wiki
Trigger status: OK
Trigger severity: Information
Trigger URL:
Item values:
1. System information (tiki.suite.wiki:system.uname): Linux tiki.suite.wiki 5.12.2-x86_64-linode144 #1 SMP Mon May 10 13:10:23 EDT 2021 x86_64
Original event ID: 109414
(14:49) bot.zabbix at diablo.montefuscolo.com.br: PROBLEM
---
Trigger: Free disk space is less than 20% on volume /
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
Item values:
1. Free disk space on / (percentage) (server.promo.suite.wiki:vfs.fs.size/,pfree): 20 %
Original event ID: 109416

.


Is it a

  • A server issue?
  • A network issue?
  • A Tiki bug?
  • A Tiki misconfiguration?
  • A spike in traffic?
  • A PHP bug? We struggled for over a year for random server crashes, and in the end, it was this: https://bugs.php.net/bug.php?id=71135


Different issues require different types of skill set to resolve.

We need better tools and processes. These will help:

  1. GlitchTip
    • Let's add a trigger on slow pages
  2. The upcoming Virtualmin + Debian 10 infrastructure : https://gitlab.com/wikisuite/virtualmin-installer
  3. Real User Measurement
  4. A monitoring solution that informs the right people in real time. We have selected Zabbix and NetData and we need a better integration with Tiki, keeping it open-ended to integrate with any monitoring system.

Steps

1.1.1. Set up Zabbix (Fabio) and NetData (Horia)

  • Sending alerts to XMPP room: xmpp:monitoring@conference.wikisuite.chat
  • With generic server monitoring (disk space, CPU, etc.)
  • Is OPcache covered? So we avoid this message in Tiki: "Little memory available. Thrashing likely to occur. The values to increase are apc.shm_size (for APC), xcache.size (for XCache) or opcache.memory_consumption (for OPcache)." If Zabbix doesn't do well, we should add a Tiki-specific alert
  • Use NetData in Virtualmin


1.1.2. Improve Tiki code to provide Tiki-specific alert to Zabbix and NetData (help needed)




Here is some code, which needs a review and a revamp:


1.1.3. Improve Tiki manager code to provide specific alerts to Zabbix and NetData (help needed)


1.1.4. Real User Measurement

  • Real User Measurement
  • tiki-performance-stats.php will provide a list of slowest pages, on which we can focus our energy.