[10:46:34] 10Traffic, 06Analytics-Kanban, 06Operations, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032576 (10elukey) [10:54:26] 10Traffic, 06Analytics-Kanban, 06Operations, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3032616 (10elukey) ``` elukey@oxygen:/srv/log/webrequest$ grep piwik archive/5xx.json-20170216 | jq -r '[.http_status,.dt]| @csv' | awk -F":" '{print $1}'| sort | u... [13:25:22] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3032910 (10elukey) After running `tcpdump ip6` on a couple of hosts I realized that the puppet agent contacts puppetmaster1001 via IPv6. I added a special term called `puppet` to `analyt... [13:25:49] so we've had the first two criticals for the expiry mailbox check [13:26:07] 04:06 < icinga-wm> PROBLEM - Check Varnish expiry mailbox lag on cp3040 is CRITICAL: CRITICAL: expiry mailbox lag is 28355 [13:26:15] 14:16 < icinga-wm> PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 11747 [13:26:53] time is CET, so that would be 03:06 and 13:16 UTC respectively [13:27:52] the first one recovered after one minute, second one after 7 minutes [13:28:31] so yeah the good news are: 1) the check seem to work fine 2) it's not eccessively spammy [13:31:26] cp1074's backend has been running for 4 days, cp3040's was restarted 5 hours ago [13:35:16] there's now a warning for cp1074, lag is 3362 [13:35:29] keeping an eye on the 503s there [13:40:18] yeah I think so long as they eventually recover and don't cause 503s, there's no reason to even warn [13:40:40] I think I've seen values run up into maybe 40-50k and recover without 503 before [13:40:48] right [13:40:53] but then there's the other pattern where they just run off and never recover [13:41:08] cp1074 is interesting at the moment, it managed to recover and go to 0 lag, now it grew back to 22k [13:41:17] the other thing is, it doesn't build up very fast. if something's watching, there's tons of time to catch it before it's an issue [13:41:19] no 503s yet [13:42:01] so, still subject to some thought about tuneables (I'm not really sure where the threshold should be!)... [13:42:30] but perhaps we set warn at something reasonable but not horrible, say 25k or 40k [13:42:49] and crit at... whatever that is.... 50-75k-ish, maybe? [13:43:07] but the key thing I'm thinking is, maybe we make the condition persist longer for crit [13:43:12] that seems reasonable [13:43:22] e.g. check once every 10 minutes, and require failing like 10/10 before alerting [13:43:36] so that self-resolving issues don't alert [13:43:42] and it should still catch real issues in plenty of time [13:46:58] actually in case like this, a warning threshold might not even make sense [13:47:10] given the purpose of WARNING is to show up when you look at that page in icinga, but not spam IRC [13:47:32] a crit threshold that has failed e.g. 5/10 will show up there too and then varnish when it resolves itself, or eventually alert IRC when it gets to 10 [13:48:07] I guess it cna't hurt either way, but it can be viewed like that. it's not the abs value we care about, it's that it stays "high" for an unreasonably long time (at which point we expect it to not recover) [13:51:47] good point, we'll notice the crit staying on the icinga page for a while anyways so the warn seems superfluous [13:52:31] interesting how long it takes to recover BTW [13:52:34] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?var-server=cp1067&var-datasource=eqiad%20prometheus%2Fops&from=1487089375715&to=1487099508490&panelId=21&fullscreen [13:52:50] this was cp1067's crazy lag leading to errors [13:53:00] 17:34 is when we depooled it [13:53:38] yeah I suspect there's some algorithmic deficiency there [13:53:53] where, when we remove traffic, it doesn't manage to take as good advantage of the idleness to catch up, as it could [13:54:11] it's probably got some fixed values in there that it only does so much recovery work per time interval or some such thing [13:54:50] (probably because prioritization isn't perfect that ensures recovery doesn't overwhelm ongoing traffic, assuming heavy traffic) [13:58:47] https://github.com/wikimedia/operations-puppet/blob/production/modules/nrpe/manifests/monitor_service.pp#L21 [13:59:01] that's not true right? It should be in minutes AFAIK [13:59:50] no I think that's the timeout for icinga contacting the NRPE plugin [14:00:06] ah! [14:00:07] I think what you're looking for is the (undocumented) check_interval + retry_interval [14:00:16] which are probably in minutes and seem to default to 1 [14:01:27] I don't even remember what retry interval is. it could logically be when "timeout" happens (comms failure / timeout), or it could be a different interval to use if the last check came back as crit? [14:06:35] FYI, I'm upgrading cp1008 to 1.1.0e in a few minutes [14:06:51] https://docs.icinga.com/latest/en/checkscheduling.html#serviceoptions [14:09:58] moritzm: thanks :) [14:15:55] https://www.ssllabs.com/ssltest/analyze.html?d=pinkunicorn.wikimedia.org&s=208.80.154.42 looks all fine, uploading to install1002 next [14:27:20] yeah so we can upgrade that seamlessly [14:27:39] upgrade package on hosts, then "service nginx upgrade" on whatever machines/clusters until done, no depools or downtimes [14:27:57] just uploaded, could you or ema take care of cp*, I'll upgrade the nginx terminators on mw* [14:28:05] yup [20:47:32] 10netops, 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#3034504 (10RobH) Ok, I took a redundant supply from cp4007 and installed it into lvs4002 power supply 2 slot. Less than a minute later, the system killed the new power supply. Record: 1022 Da...