[00:21:18] <wikibugs>	 10SRE, 10observability, 10serviceops, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Legoktm)
[01:07:02] <wikibugs>	 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Legoktm) Maybe this is a silly question, but why does PuppetDB need to b...
[01:38:15] <Krinkle>	 Platonides: bd808 I've documented x2 at https://wikitech.wikimedia.org/wiki/MariaDB#Extension_storage
[01:39:01] <Krinkle>	 it's currently empty and not yet known by MW. It will be backend for MainStash, read-write and bi-di replicated. 
[01:42:05] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Fix Mailman3 log rotate - https://phabricator.wikimedia.org/T285376 (10Legoktm) 05Open→03Resolved
[02:07:11] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Fix Mailman3 log rotate - https://phabricator.wikimedia.org/T285376 (10Legoktm) >>! In T285376#7174385, @Legoktm wrote: > I'll file a bug upstream in Debian too.  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990336
[02:10:03] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:14] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] "You will need to schedule this in a swat window after the train is deployed .. so thursday evening on https://wikitech.wikimedia.org/wiki/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra)
[05:02:33] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Legoktm) >>! In T285486#7175992, @Legoktm wrote: > And I think there's a Mailman bug here that if for whatever reason digests get disabled, members should be switched to r...
[07:31:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[07:35:33] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:03:01] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:06:41] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:54:27] <icinga-wm>	 PROBLEM - Check systemd state on mw1367 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:41] <wikibugs>	 (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/701597" [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis)
[09:10:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis)
[09:26:08] <elukey>	 weird, mw1367's restart service says failed for status=4/NOPERMISSION
[09:26:12] <elukey>	 that I have never seen
[09:26:15] <elukey>	 trying to restart
[09:27:36] <elukey>	 ah the script itself returns 4
[09:30:45] <elukey>	 interesting, with set -x the script ends up in
[09:30:45] <elukey>	 APCU_FRAGMENTATION='parse error: Invalid numeric literal at line 1, column 10'
[09:31:52] <elukey>	 ah ok php7adm /apcu-frag returns an error page
[09:33:03] <elukey>	 and on php fpms' logs I see
[09:33:05] <elukey>	 PHP Fatal error:  Allowed memory size of 524288000 bytes exhausted etc..
[09:33:40] <elukey>	 !log restart php-fpm on mw1367 (php fatal memory errors, php7adm /apcu-frag returns errors)
[09:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:13] <elukey>	 all right better now
[09:35:25] <icinga-wm>	 RECOVERY - Check systemd state on mw1367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:36] <elukey>	 and we have some workers with php-fpm busy (> 60%)
[09:38:59] <elukey>	 !log reboot mw1414 (not reachable via ssh, nor via mgmt console)
[09:39:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:10] <elukey>	 this is probably a new node
[09:44:25] <elukey>	 !log restart php-fpm on mw1354
[09:44:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:07] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[09:45:27] <elukey>	 !log restart php-fpm on mw135[4-5]
[09:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:46] <elukey>	 ok the above ones were showing a very slow response time
[09:47:03] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[09:52:59] <elukey>	 avg latency for gets is around 400ms, the p99 is 2s (doubled), not great
[09:53:36] <elukey>	 err 95th
[09:58:37] <elukey>	 the % of appservers with busy workers is again at 10%
[10:00:10] <elukey>	 well goes back and forth 
[10:00:27] <elukey>	 nothing seems to be really on fire, I would wait a bit to see how things are going before taking more actions
[10:00:45] <elukey>	 I don't see anything weird going on that would require paging in people
[10:02:51] <icinga-wm>	 PROBLEM - Check systemd state on mw1372 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:46] <wikibugs>	 10SRE, 10serviceops: php7.2-fpm_check_restart should be resilient to php7adm error pages - https://phabricator.wikimedia.org/T285593 (10elukey)
[10:07:30] <elukey>	 !log restart php-fpm on mw1372 - T285593
[10:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:38] <stashbot>	 T285593: php7.2-fpm_check_restart should be resilient to php7adm error pages - https://phabricator.wikimedia.org/T285593
[10:08:04] <elukey>	 !log restart php-fpm on mw1372 - T285593
[10:08:08] <elukey>	 ah snap
[10:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:41] <icinga-wm>	 RECOVERY - Check systemd state on mw1372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:10:34] * elukey bbiab
[11:47:30] <elukey>	 the % of busy appservers is within range now, but the avg latency is still ~100ms higher than normal
[11:48:17] <elukey>	 and 95th percentile is ~2s, that is double than usual
[12:18:33] <Platonides>	 nice, Krinkle 
[12:59:34] <elukey>	 ah lovely, avg is now ~800ms, this is definitely not great
[13:06:30] <XioNoX>	 elukey: yo? I think you send me something but victorops logged me out
[13:07:16] <cdanis>	 so yesterday IIRC we had a bunch of appservers seemingly get 'stuck' with busy workers
[13:07:29] <volans|off>	 elukey: I'm on mobile but if nobody is around I can get to a laptop within 15m
[13:08:00] <elukey>	 hey folks, thanks for joining, I just need some help in figuring our what's happening, nothing completely broken now
[13:08:07] <elukey>	 but mw latency is not great
[13:08:10] <cdanis>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=54&orgId=1&from=now-24h&to=now
[13:08:15] <cdanis>	 doesn't look quite as bad today so far
[13:08:52] <godog>	 here too
[13:09:20] <elukey>	 cdanis: I restarted a couple of appservers earlier on, and decided to wait a bit to see if things improved, but after 12UTC the latency steadily increased
[13:09:38] <elukey>	 I tried to check on the apache logs to find some clue, but I didn't spot a clear trend
[13:09:41] <cdanis>	 yeah
[13:09:50] <elukey>	 ciao godog 
[13:10:03] <cdanis>	 I wasn't digging yesterday, I think rzl was? but I don't recall there being a diagnosis for why appservers were getting wedged
[13:10:03] <godog>	 elukey: ciao!
[13:11:10] <elukey>	 cdanis: IIRC the triggered was identified in https://phabricator.wikimedia.org/T285538
[13:11:43] <volans|off>	 given few people are around I'm noy rushing back to the laptop but feel free to ping me on VO if I'm needed please
[13:11:48] <volans|off>	 *not
[13:13:23] <cdanis>	 elukey: one thing I noticed both today and yesterday is a sizable increase in the wanobjectcache miss rate for both SqlBlobStore_blob and also filerepo_file_foreign_description
[13:13:38] <cdanis>	 not sure what to make of that
[13:15:28] <elukey>	 yeah I recall that, not sure either
[13:15:33] <elukey>	 I see from lvs1016 a lot of
[13:15:34] <elukey>	 ERROR: Monitoring instance ProxyFetch reports server mw1333.eqiad.wmnet (enabled/up/pooled) down: Getting https://en.wikipedia.org/wiki/Special:BlankPage took longer than 5 seconds.
[13:16:24] <elukey>	 so maybe the higher latency is not exacerbated by pybal removing appservers
[13:16:47] <elukey>	 or maybe only some of them are showing this 
[13:17:11] <elukey>	 should I try to restart php-fpm on a couple of them to see if they improve?
[13:17:18] <elukey>	 while others are digging in
[13:18:28] <cdanis>	 there are several appservers that are totally saturated with 0 idle workers
[13:18:45] <cdanis>	 elukey: https://w.wiki/3Ydv restart these in the table with value 0
[13:20:17] <elukey>	 cdanis: mw1333 is the one I am looking at (got it from pybal's log), seem a good target
[13:20:34] <elukey>	 !log restart-phpfpm on mw1333 (0 idle php workers)
[13:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:23] <elukey>	 !log restart-phpfpm on mw1350 (0 idle php workers)
[13:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:03] <cdanis>	 we should probably leave 1-2 as-is for investigation
[13:25:09] <elukey>	 I don't see 33/50 in the pybal logs anymore
[13:25:54] <elukey>	 but I feel that we may need to roll restart them all, probably quicker than playing whack a mole :D
[13:26:35] <elukey>	 well maybe not, I see 10 left in your list cnadis
[13:26:38] <elukey>	 *cdanis
[13:26:47] <elukey>	 I am going to roll restart 5/6 of them
[13:28:00] <cdanis>	 👍
[13:28:37] <icinga-wm>	 PROBLEM - Check systemd state on mw1349 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:30:20] <elukey>	 !log restart php-fpm on mw1351 mw1373 mw1352 mw1349 
[13:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:35] <icinga-wm>	 RECOVERY - Check systemd state on mw1349 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:06] <elukey>	 ok the latency trend is better now (going down)
[13:32:40] <jynus>	 I was checking weird trends on traffic, but I see none
[13:33:02] <jynus>	 I was also checking weirdness on es (because of blob comment), pc, see none
[13:33:08] <elukey>	 !log restart phpfpm on mw1353 mw1365 mw1371
[13:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:15] <jynus>	 lots of activity on az, commons, but nothing out of the ordinary
[13:34:33] <elukey>	 yes me too, really strange
[13:37:38] <elukey>	 cdanis: from lvs1016's logs I see mostly mw1384 right now failing the health check, should I depool it for investigation?
[13:37:54] <elukey>	 mw1370 is also flapping sometimes
[13:38:46] <godog>	 given that things seem under control I'll resolve the incident (or it'll re-page tomorrow), sounds good ?
[13:39:28] <cdanis>	 elukey: yeah please depool
[13:40:17] <cdanis>	 godog: sure
[13:40:33] <godog>	 ok!
[13:43:19] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1384.eqiad.wmnet
[13:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:34] <elukey>	 !log depool mw1384 for investigation
[13:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:11] <elukey>	 pybal seems happy so far (there are still a couple of nodes with 0 idle workers from cdanis' grafana list but they seem not to fail pybal's health check)
[13:46:44] <elukey>	 latency is way better now, even if p95 still looks a little high, but I'd say that we should be good for the moment
[13:47:00] <elukey>	 if it rehappens I'd say to just roll restart them all
[13:47:14] <elukey>	 to start fresh
[13:47:53] <godog>	 SGTM
[13:48:31] <godog>	 I'll be afk but otherwise available
[13:49:24] <elukey>	 !log restart php-fpm on mw1368 mw1370 mw1366 mw1409 
[13:49:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:03] <elukey>	 cdanis: your list is basically cleared, not sure if anybody has time/ideas for mw1384 - if not I'd say to just restart php-fpm and repool it
[13:52:51] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[13:53:17] <elukey>	 nice :)
[13:53:23] <elukey>	 now graphs look really good
[13:54:26] <elukey>	 thanks all for the help, hope to not see/read from you again this weekend :)
[13:54:57] <elukey>	 I'll keep mw1384 depooled, and repool it say later on if nobody has better ideas
[13:55:39] <wikibugs>	 (03CR) 10Paladox: "Think I've got it all now." [puppet] - 10https://gerrit.wikimedia.org/r/701632 (owner: 10Paladox)
[14:00:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:15:36] <urbanecm>	 elukey: latency sounds to be going back again :( not as high as before, but going up steadily
[14:37:43] <icinga-wm>	 PROBLEM - Check systemd state on mw1384 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:03] <wikibugs>	 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10bd808) >>! In T285539#7178460, @Legoktm wrote: > Maybe this is a silly q...
[15:18:46] <wikibugs>	 (03PS1) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594)
[15:19:27] <wikibugs>	 (03CR) 10Zabe: [C: 04-1] flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe)
[15:20:02] <wikibugs>	 (03CR) 10Zabe: [C: 04-1] flood flag changes for enwikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe)
[15:25:08] <elukey>	 urbanecm: yeah we are back in the same state as before :(
[15:25:45] <urbanecm>	 looks so. i have to say i feel like WM projects are slower than normally (and a lot of faster when i log out), but that might be just because i saw the latency graphs :D
[15:25:54] <wikibugs>	 (03PS2) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594)
[15:26:20] <elukey>	 an interesting thing that I want to check is if the same hosts that I have restarted are showing up problems
[15:26:39] <elukey>	 because from a quick look in pybal's logs it seems that they are other nodes
[15:26:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe)
[15:27:25] <urbanecm>	 but why would random nodes suddently go slower and slower?
[15:27:51] <wikibugs>	 (03PS3) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594)
[15:29:25] <urbanecm>	 (i guess that's a question no one has yet answer to)
[15:29:33] <elukey>	 urbanecm: my running theory is that yesterday's trouble caused some weird state in appservers, and there must be a trigger that makes them trashing
[15:29:46] <elukey>	 but I can confirm that the hosts that I have restarted earlier on seem fine
[15:31:26] <elukey>	 !log restart php-fpm on mw1391 mw1389 mw1403
[15:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:13] <elukey>	 maybe a complete roll restart is what we need right now
[15:33:00] <urbanecm>	 i'm not able to judge on that
[15:37:05] <elukey>	 !log restart php-fpm on mw1397 mw1395 mw1411 mw1407
[15:37:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:35] <elukey>	 latency dropped 200ms
[15:38:38] <urbanecm>	 let's hope it won't go back up in several minutes
[15:38:45] <urbanecm>	 thanks for working on it during weekend elukey :)
[15:39:25] <elukey>	 !log restart php-fpm on mw1405 mw1399 mw1385
[15:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:33] <elukey>	 urbanecm: I can say the same thing for you :)
[15:39:59] <elukey>	 ok I should have restarted the bad workers
[15:40:04] <elukey>	 and latency is dropping as expected
[15:40:13] <urbanecm>	 i'm not really working on the latency in any way. I only opened the graph and pinged you :)
[15:40:36] <elukey>	 but at this point we may get into the weird state again, but I'd love some confirmation from other SREs before doing a roll restart :D
[15:40:57] <elukey>	 urbanecm: you always work a ton on Mediawiki so we should say thank you every now and then :)
[15:41:13] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:42:46] <urbanecm>	 well, that feels nice to hear. i'm glad i'm helpful.
[15:43:13] <elukey>	 !log restart php-fpm on mw1393
[15:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:27] <elukey>	 so afaics from the sal --^ I haven't restarted the same node two times
[15:43:53] <elukey>	 that makes me think that after the restart the weird state gets cleared
[15:44:05] <urbanecm>	 let's hope
[15:45:03] <elukey>	 we have 65 appservers in eqiad, lemme count how many I have restarted
[15:45:32] <elukey>	 seems 31 more or less
[15:45:45] <elukey>	 so potentially half of them are still in the weird state
[15:46:19] <elukey>	 going to check in half an hour :)
[15:47:12] <elukey>	 ttl urbanecm! 
[15:47:30] <urbanecm>	 ttyl elukey 
[16:37:08] <elukey>	 !log restart php-fpm on mw1387
[16:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:27] <wikibugs>	 (03PS1) 10Ladsgroup: mailman: Enable verp probes [puppet] - 10https://gerrit.wikimedia.org/r/701658 (https://phabricator.wikimedia.org/T285361)
[16:51:43] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) Also this documentation is useful https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/model/docs/bounce.html#verp-probes
[16:51:50] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) @Legoktm It looks easy but do you have a way to test it? We can just enable it in production and see if it solves the issue, otherwise the only way to test it is to s...
[17:33:11] <icinga-wm>	 PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:48:10] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Legoktm) >>! In T285361#7178885, @Ladsgroup wrote: > @Legoktm It looks easy but do you have a way to test it? We can just enable it in production and see if it solves the issue,...
[18:08:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[18:31:55] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[18:33:51] <icinga-wm>	 RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:44:39] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[19:00:03] <elukey>	 this one seems not related to what happened before --^
[19:10:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) okay, let me give it a try
[19:10:57] <icinga-wm>	 PROBLEM - MD RAID on mw2380 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:10:58] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on mw2380 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T285603 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:11:02] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10ops-monitoring-bot)
[19:51:54] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) Got this: ` Jun 26 19:47:59 2021 (16060) Member ladsgroup+test@gmail.com on list test70.polymorphic.lists.wmcloud.org, bounce score 71 >= threshold 5, sending probe....
[20:00:39] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) There is nothing in exim4 logs though :/
[20:19:59] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on mw2380 is CRITICAL: cluster=jobrunner device=sda instance=mw2380 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2380&var-datasource=codfw+prometheus/ops
[20:24:32] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) I got this in exim4: ` 2021-06-26 19:31:03 1lxE19-0003pA-FQ <= test4-bounces+ladsgroup+mailmanroot=gmail.com@polymorphic.lists.wmcloud.org H=localhost (mailman03.mail...
[20:53:06] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.56 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701663
[20:53:28] <wikibugs>	 (03PS1) 10Ladsgroup: labs: Set json for metadata array and split metadata to ES when needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701664 (https://phabricator.wikimedia.org/T275268)
[20:56:45] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701664 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup)
[20:57:24] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Set json for metadata array and split metadata to ES when needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701664 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup)
[20:58:04] <Amir1>	 rebased ^
[20:59:55] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.56 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701663 (owner: 10Volans)
[21:05:27] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.56 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701663 (owner: 10Volans)
[21:08:28] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.0.56 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/701665
[21:12:48] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v0.0.56 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/701665 (owner: 10Volans)
[21:18:14] <wikibugs>	 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10faidon) Thank you @jbond for raising this topic!  To noone's surprise, o...
[21:23:07] <volans>	 !log uploaded spicerack_0.0.56 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia
[21:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:02] <volans>	 !log upgraded spicerack to v0.0.56 on the cumin hosts (includes only bug fixes for the switchdc)
[21:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) I'm honestly lost for know if you want to take a stab at it.
[23:21:07] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook