[08:24:28] <XioNoX>	 there are a lot of "WARNING: opcache free space is below 100 MB" in Icinga
[08:32:52] <XioNoX>	 dunno if it still should be a warning if 7 hosts are alerting? maybe _joe_ ^ ?
[08:33:14] <_joe_>	 yes XioNoX I'm writing a runbook for that *right now* :D
[08:33:34] <XioNoX>	 haha perfect!
[08:33:35] <_joe_>	 those alerts are all from codfw (also warnings) so it's not /that/ urgent
[08:58:03] <jynus>	 XioNoX: there was some network instability yesterday evening
[08:58:36] <XioNoX>	 jynus: ah?
[08:58:42] <jynus>	 did you see that, and could you help us see if it could happen again? It happened at peak traffic, so maybe there is some saturation?
[09:00:34] <jynus>	 aproximately between 18 and 23 UTC
[09:00:36] <XioNoX>	 jynus: do you have links or pointers?
[09:00:47] <XioNoX>	 what were the symptoms?
[09:00:50] <jynus>	 the initial pointer was user complains
[09:00:58] <jynus>	 which I could also suffer
[09:01:04] <jynus>	 timeouts, high latency
[09:01:30] <jynus>	 monitoring objectivized that with several alerts: https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&orgId=1&from=now-24h&to=now
[09:01:33] <XioNoX>	 did anyone run some mtr?
[09:02:18] <jynus>	 XioNoX: we didn't know it was network, we assumed at first it was some kind of traffic ar app issues
[09:02:28] <jynus>	 it took us too much time to understand what was happening
[09:02:41] <XioNoX>	 ok
[09:02:58] <jynus>	 by looking at https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=84&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_text&var-instance=All
[09:03:27] <jynus>	 that is not of couse any issue, but showed something strange was going on
[09:04:04] <jynus>	 I don't exect you to know what happend, but pointers like run mtr or others will be welcome
[09:04:31] <jynus>	 or if you can see any kind of flopping or saturation on a link at those times
[09:05:04] <jynus>	 XioNoX: I think this is not normal: https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=now-7d&to=now
[09:05:12] <XioNoX>	 we got quite a spike of traffic around 16:00 UTC
[09:05:40] <jynus>	 yeah, I saw that, but do you think that could so much as to affect network?
[09:06:37] <XioNoX>	 it depends on many factors, including how it was distributed
[09:06:48] <XioNoX>	 but doesn't match the 18-22UTC window
[09:08:06] <XioNoX>	 that spike of retransmits is not normal, but was on monday, seems related to a spike of outbound traffic as well
[09:08:32] <jynus>	 so if you could spend some time trying to see if you can understand what happened yesterday and or any advice on how to proceed?
[09:08:48] <jynus>	 or even discarding it was a network issue
[09:09:15] <XioNoX>	 yeah, I'm digging
[09:09:32] <jynus>	 thanks, XioNoX
[09:10:03] <jynus>	 regarding traffic, we saw a lower amount of availability since yesterday, but that didn't correlate with the issues
[09:10:11] <jynus>	 (on varnish)
[09:10:30] <XioNoX>	 unrelated, but thumbor started sending a lot of ICMP dest unreach https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=20&fullscreen&orgId=1&from=now-3h&to=now
[09:19:38] <XioNoX>	 localhost > localhost: ICMP localhost udp port syslog unreachable
[09:19:38] <XioNoX>	 localhost.46156 > localhost.syslog
[09:19:38] <XioNoX>	 haproxy is trying to send syslog messages to udp/514 but nothing is listening
[09:20:22] <XioNoX>	 herron/godog ^ ?
[09:21:40] <jynus>	 XioNoX: https://docs.google.com/document/d/1Z7ZVdtGTKGgCsGCr9FXJ19LmeBV3jIDq0-B7wvP0gYk/edit
[09:22:18] <XioNoX>	 jynus: thanks!
[09:23:05] <jynus>	 People complaining with things like: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='no.wikipedia.org', port=443): Read timed out.
[09:23:22] <jynus>	 and I myself had issues with CSS not loading properly several times
[09:23:45] <godog>	 XioNoX: indeed, haproxy should be using the local syslog unix socket not udp syslog
[09:25:50] <XioNoX>	 godog: should I open a task or you got it?
[09:26:35] <godog>	 XioNoX: please open a task! #thumbor and #observability
[09:31:00] <XioNoX>	 godog: https://phabricator.wikimedia.org/T225284
[09:32:37] <godog>	 thanks XioNoX ! appreciate it
[09:33:23] <XioNoX>	 random find of the day :)
[09:33:34] <XioNoX>	 now back to the esams issue
[09:33:55] <godog>	 indeed, monitoring is like that, the closer you look the more broken stuff you find :)
[09:41:23] <_joe_>	 XioNoX: as promised https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#PHP7_opcache_health
[09:51:28] <jynus>	 _joe_: thank you, that is very useful!
[09:59:57] <XioNoX>	 thanks!
[10:08:43] <_joe_>	 It will now be linked to alerts
[10:10:14] <arturo>	 akosiaris: sent a couple of etcd patches your way for review cc fsero
[10:13:35] <_joe_>	 arturo: you should mainly refer to me regarding etcd
[10:13:44] <arturo>	 ack _joe_
[10:13:54] <arturo>	 let me add you as reviewer then
[10:14:01] <_joe_>	 it's not just used for kubernetes, it's also our poor man's chubby for production
[10:14:09] <_joe_>	 without it, mediawiki doesn't work.
[10:14:25] <_joe_>	 or better, an application server can't start without it
[10:44:36] <arturo>	 XioNoX: already at dublin? :-)
[10:51:42] <XioNoX>	 yup :)
[11:01:11] <XioNoX>	 godog: you're right, I'm finding issues unrelated to the one I'm supposed to investigate, see https://phabricator.wikimedia.org/T225296
[12:13:48] <godog>	 XioNoX: ouch, thanks for filing the task!
[12:56:23] <arturo>	 elukey: you around?
[12:58:47] <arturo>	 I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/514988 since it has +1 x3
[13:23:41] <elukey>	 arturo: sorry just seen the ping, please go ahead
[13:29:51] <XioNoX>	 jynus: I can't find signs of network congestion, the google doc seems to point to a single user overloading the system? maybe causing the 503s?