[08:24:28] there are a lot of "WARNING: opcache free space is below 100 MB" in Icinga [08:32:52] dunno if it still should be a warning if 7 hosts are alerting? maybe _joe_ ^ ? [08:33:14] <_joe_> yes XioNoX I'm writing a runbook for that *right now* :D [08:33:34] haha perfect! [08:33:35] <_joe_> those alerts are all from codfw (also warnings) so it's not /that/ urgent [08:58:03] XioNoX: there was some network instability yesterday evening [08:58:36] jynus: ah? [08:58:42] did you see that, and could you help us see if it could happen again? It happened at peak traffic, so maybe there is some saturation? [09:00:34] aproximately between 18 and 23 UTC [09:00:36] jynus: do you have links or pointers? [09:00:47] what were the symptoms? [09:00:50] the initial pointer was user complains [09:00:58] which I could also suffer [09:01:04] timeouts, high latency [09:01:30] monitoring objectivized that with several alerts: https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&orgId=1&from=now-24h&to=now [09:01:33] did anyone run some mtr? [09:02:18] XioNoX: we didn't know it was network, we assumed at first it was some kind of traffic ar app issues [09:02:28] it took us too much time to understand what was happening [09:02:41] ok [09:02:58] by looking at https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=84&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_text&var-instance=All [09:03:27] that is not of couse any issue, but showed something strange was going on [09:04:04] I don't exect you to know what happend, but pointers like run mtr or others will be welcome [09:04:31] or if you can see any kind of flopping or saturation on a link at those times [09:05:04] XioNoX: I think this is not normal: https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=now-7d&to=now [09:05:12] we got quite a spike of traffic around 16:00 UTC [09:05:40] yeah, I saw that, but do you think that could so much as to affect network? [09:06:37] it depends on many factors, including how it was distributed [09:06:48] but doesn't match the 18-22UTC window [09:08:06] that spike of retransmits is not normal, but was on monday, seems related to a spike of outbound traffic as well [09:08:32] so if you could spend some time trying to see if you can understand what happened yesterday and or any advice on how to proceed? [09:08:48] or even discarding it was a network issue [09:09:15] yeah, I'm digging [09:09:32] thanks, XioNoX [09:10:03] regarding traffic, we saw a lower amount of availability since yesterday, but that didn't correlate with the issues [09:10:11] (on varnish) [09:10:30] unrelated, but thumbor started sending a lot of ICMP dest unreach https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=20&fullscreen&orgId=1&from=now-3h&to=now [09:19:38] localhost > localhost: ICMP localhost udp port syslog unreachable [09:19:38] localhost.46156 > localhost.syslog [09:19:38] haproxy is trying to send syslog messages to udp/514 but nothing is listening [09:20:22] herron/godog ^ ? [09:21:40] XioNoX: https://docs.google.com/document/d/1Z7ZVdtGTKGgCsGCr9FXJ19LmeBV3jIDq0-B7wvP0gYk/edit [09:22:18] jynus: thanks! [09:23:05] People complaining with things like: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='no.wikipedia.org', port=443): Read timed out. [09:23:22] and I myself had issues with CSS not loading properly several times [09:23:45] XioNoX: indeed, haproxy should be using the local syslog unix socket not udp syslog [09:25:50] godog: should I open a task or you got it? [09:26:35] XioNoX: please open a task! #thumbor and #observability [09:31:00] godog: https://phabricator.wikimedia.org/T225284 [09:32:37] thanks XioNoX ! appreciate it [09:33:23] random find of the day :) [09:33:34] now back to the esams issue [09:33:55] indeed, monitoring is like that, the closer you look the more broken stuff you find :) [09:41:23] <_joe_> XioNoX: as promised https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#PHP7_opcache_health [09:51:28] _joe_: thank you, that is very useful! [09:59:57] thanks! [10:08:43] <_joe_> It will now be linked to alerts [10:10:14] akosiaris: sent a couple of etcd patches your way for review cc fsero [10:13:35] <_joe_> arturo: you should mainly refer to me regarding etcd [10:13:44] ack _joe_ [10:13:54] let me add you as reviewer then [10:14:01] <_joe_> it's not just used for kubernetes, it's also our poor man's chubby for production [10:14:09] <_joe_> without it, mediawiki doesn't work. [10:14:25] <_joe_> or better, an application server can't start without it [10:44:36] XioNoX: already at dublin? :-) [10:51:42] yup :) [11:01:11] godog: you're right, I'm finding issues unrelated to the one I'm supposed to investigate, see https://phabricator.wikimedia.org/T225296 [12:13:48] XioNoX: ouch, thanks for filing the task! [12:56:23] elukey: you around? [12:58:47] I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/514988 since it has +1 x3 [13:23:41] arturo: sorry just seen the ping, please go ahead [13:29:51] jynus: I can't find signs of network congestion, the google doc seems to point to a single user overloading the system? maybe causing the 503s?