[00:21:18] 10SRE, 10observability, 10serviceops, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Legoktm) [01:07:02] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Legoktm) Maybe this is a silly question, but why does PuppetDB need to b... [01:38:15] Platonides: bd808 I've documented x2 at https://wikitech.wikimedia.org/wiki/MariaDB#Extension_storage [01:39:01] it's currently empty and not yet known by MW. It will be backend for MainStash, read-write and bi-di replicated. [01:42:05] 10SRE, 10Wikimedia-Mailing-lists: Fix Mailman3 log rotate - https://phabricator.wikimedia.org/T285376 (10Legoktm) 05Open→03Resolved [02:07:11] 10SRE, 10Wikimedia-Mailing-lists: Fix Mailman3 log rotate - https://phabricator.wikimedia.org/T285376 (10Legoktm) >>! In T285376#7174385, @Legoktm wrote: > I'll file a bug upstream in Debian too. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990336 [02:10:03] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:14] (03CR) 10Subramanya Sastry: [C: 03+1] "You will need to schedule this in a swat window after the train is deployed .. so thursday evening on https://wikitech.wikimedia.org/wiki/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701612 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [05:02:33] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Legoktm) >>! In T285486#7175992, @Legoktm wrote: > And I think there's a Mailman bug here that if for whatever reason digests get disabled, members should be switched to r... [07:31:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:35:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:03:01] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:06:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:54:27] PROBLEM - Check systemd state on mw1367 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:41] (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/701597" [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [09:10:55] (03CR) 10jerkins-bot: [V: 04-1] statograph: Initial commit [software/statograph] - 10https://gerrit.wikimedia.org/r/701599 (https://phabricator.wikimedia.org/T285569) (owner: 10CDanis) [09:26:08] weird, mw1367's restart service says failed for status=4/NOPERMISSION [09:26:12] that I have never seen [09:26:15] trying to restart [09:27:36] ah the script itself returns 4 [09:30:45] interesting, with set -x the script ends up in [09:30:45] APCU_FRAGMENTATION='parse error: Invalid numeric literal at line 1, column 10' [09:31:52] ah ok php7adm /apcu-frag returns an error page [09:33:03] and on php fpms' logs I see [09:33:05] PHP Fatal error: Allowed memory size of 524288000 bytes exhausted etc.. [09:33:40] !log restart php-fpm on mw1367 (php fatal memory errors, php7adm /apcu-frag returns errors) [09:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:13] all right better now [09:35:25] RECOVERY - Check systemd state on mw1367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:36] and we have some workers with php-fpm busy (> 60%) [09:38:59] !log reboot mw1414 (not reachable via ssh, nor via mgmt console) [09:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:10] this is probably a new node [09:44:25] !log restart php-fpm on mw1354 [09:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:07] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [09:45:27] !log restart php-fpm on mw135[4-5] [09:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:46] ok the above ones were showing a very slow response time [09:47:03] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [09:52:59] avg latency for gets is around 400ms, the p99 is 2s (doubled), not great [09:53:36] err 95th [09:58:37] the % of appservers with busy workers is again at 10% [10:00:10] well goes back and forth [10:00:27] nothing seems to be really on fire, I would wait a bit to see how things are going before taking more actions [10:00:45] I don't see anything weird going on that would require paging in people [10:02:51] PROBLEM - Check systemd state on mw1372 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:46] 10SRE, 10serviceops: php7.2-fpm_check_restart should be resilient to php7adm error pages - https://phabricator.wikimedia.org/T285593 (10elukey) [10:07:30] !log restart php-fpm on mw1372 - T285593 [10:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:38] T285593: php7.2-fpm_check_restart should be resilient to php7adm error pages - https://phabricator.wikimedia.org/T285593 [10:08:04] !log restart php-fpm on mw1372 - T285593 [10:08:08] ah snap [10:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:41] RECOVERY - Check systemd state on mw1372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:34] * elukey bbiab [11:47:30] the % of busy appservers is within range now, but the avg latency is still ~100ms higher than normal [11:48:17] and 95th percentile is ~2s, that is double than usual [12:18:33] nice, Krinkle [12:59:34] ah lovely, avg is now ~800ms, this is definitely not great [13:06:30] elukey: yo? I think you send me something but victorops logged me out [13:07:16] so yesterday IIRC we had a bunch of appservers seemingly get 'stuck' with busy workers [13:07:29] elukey: I'm on mobile but if nobody is around I can get to a laptop within 15m [13:08:00] hey folks, thanks for joining, I just need some help in figuring our what's happening, nothing completely broken now [13:08:07] but mw latency is not great [13:08:10] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=54&orgId=1&from=now-24h&to=now [13:08:15] doesn't look quite as bad today so far [13:08:52] here too [13:09:20] cdanis: I restarted a couple of appservers earlier on, and decided to wait a bit to see if things improved, but after 12UTC the latency steadily increased [13:09:38] I tried to check on the apache logs to find some clue, but I didn't spot a clear trend [13:09:41] yeah [13:09:50] ciao godog [13:10:03] I wasn't digging yesterday, I think rzl was? but I don't recall there being a diagnosis for why appservers were getting wedged [13:10:03] elukey: ciao! [13:11:10] cdanis: IIRC the triggered was identified in https://phabricator.wikimedia.org/T285538 [13:11:43] given few people are around I'm noy rushing back to the laptop but feel free to ping me on VO if I'm needed please [13:11:48] *not [13:13:23] elukey: one thing I noticed both today and yesterday is a sizable increase in the wanobjectcache miss rate for both SqlBlobStore_blob and also filerepo_file_foreign_description [13:13:38] not sure what to make of that [13:15:28] yeah I recall that, not sure either [13:15:33] I see from lvs1016 a lot of [13:15:34] ERROR: Monitoring instance ProxyFetch reports server mw1333.eqiad.wmnet (enabled/up/pooled) down: Getting https://en.wikipedia.org/wiki/Special:BlankPage took longer than 5 seconds. [13:16:24] so maybe the higher latency is not exacerbated by pybal removing appservers [13:16:47] or maybe only some of them are showing this [13:17:11] should I try to restart php-fpm on a couple of them to see if they improve? [13:17:18] while others are digging in [13:18:28] there are several appservers that are totally saturated with 0 idle workers [13:18:45] elukey: https://w.wiki/3Ydv restart these in the table with value 0 [13:20:17] cdanis: mw1333 is the one I am looking at (got it from pybal's log), seem a good target [13:20:34] !log restart-phpfpm on mw1333 (0 idle php workers) [13:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:23] !log restart-phpfpm on mw1350 (0 idle php workers) [13:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:03] we should probably leave 1-2 as-is for investigation [13:25:09] I don't see 33/50 in the pybal logs anymore [13:25:54] but I feel that we may need to roll restart them all, probably quicker than playing whack a mole :D [13:26:35] well maybe not, I see 10 left in your list cnadis [13:26:38] *cdanis [13:26:47] I am going to roll restart 5/6 of them [13:28:00] 👍 [13:28:37] PROBLEM - Check systemd state on mw1349 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:20] !log restart php-fpm on mw1351 mw1373 mw1352 mw1349 [13:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:35] RECOVERY - Check systemd state on mw1349 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:06] ok the latency trend is better now (going down) [13:32:40] I was checking weird trends on traffic, but I see none [13:33:02] I was also checking weirdness on es (because of blob comment), pc, see none [13:33:08] !log restart phpfpm on mw1353 mw1365 mw1371 [13:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:15] lots of activity on az, commons, but nothing out of the ordinary [13:34:33] yes me too, really strange [13:37:38] cdanis: from lvs1016's logs I see mostly mw1384 right now failing the health check, should I depool it for investigation? [13:37:54] mw1370 is also flapping sometimes [13:38:46] given that things seem under control I'll resolve the incident (or it'll re-page tomorrow), sounds good ? [13:39:28] elukey: yeah please depool [13:40:17] godog: sure [13:40:33] ok! [13:43:19] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1384.eqiad.wmnet [13:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:34] !log depool mw1384 for investigation [13:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:11] pybal seems happy so far (there are still a couple of nodes with 0 idle workers from cdanis' grafana list but they seem not to fail pybal's health check) [13:46:44] latency is way better now, even if p95 still looks a little high, but I'd say that we should be good for the moment [13:47:00] if it rehappens I'd say to just roll restart them all [13:47:14] to start fresh [13:47:53] SGTM [13:48:31] I'll be afk but otherwise available [13:49:24] !log restart php-fpm on mw1368 mw1370 mw1366 mw1409 [13:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:03] cdanis: your list is basically cleared, not sure if anybody has time/ideas for mw1384 - if not I'd say to just restart php-fpm and repool it [13:52:51] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:53:17] nice :) [13:53:23] now graphs look really good [13:54:26] thanks all for the help, hope to not see/read from you again this weekend :) [13:54:57] I'll keep mw1384 depooled, and repool it say later on if nobody has better ideas [13:55:39] (03CR) 10Paladox: "Think I've got it all now." [puppet] - 10https://gerrit.wikimedia.org/r/701632 (owner: 10Paladox) [14:00:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:15:36] elukey: latency sounds to be going back again :( not as high as before, but going up steadily [14:37:43] PROBLEM - Check systemd state on mw1384 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:03] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10bd808) >>! In T285539#7178460, @Legoktm wrote: > Maybe this is a silly q... [15:18:46] (03PS1) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) [15:19:27] (03CR) 10Zabe: [C: 04-1] flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [15:20:02] (03CR) 10Zabe: [C: 04-1] flood flag changes for enwikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [15:25:08] urbanecm: yeah we are back in the same state as before :( [15:25:45] looks so. i have to say i feel like WM projects are slower than normally (and a lot of faster when i log out), but that might be just because i saw the latency graphs :D [15:25:54] (03PS2) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) [15:26:20] an interesting thing that I want to check is if the same hosts that I have restarted are showing up problems [15:26:39] because from a quick look in pybal's logs it seems that they are other nodes [15:26:43] (03CR) 10jerkins-bot: [V: 04-1] flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) (owner: 10Zabe) [15:27:25] but why would random nodes suddently go slower and slower? [15:27:51] (03PS3) 10Zabe: flood flag changes for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701651 (https://phabricator.wikimedia.org/T285594) [15:29:25] (i guess that's a question no one has yet answer to) [15:29:33] urbanecm: my running theory is that yesterday's trouble caused some weird state in appservers, and there must be a trigger that makes them trashing [15:29:46] but I can confirm that the hosts that I have restarted earlier on seem fine [15:31:26] !log restart php-fpm on mw1391 mw1389 mw1403 [15:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:13] maybe a complete roll restart is what we need right now [15:33:00] i'm not able to judge on that [15:37:05] !log restart php-fpm on mw1397 mw1395 mw1411 mw1407 [15:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:35] latency dropped 200ms [15:38:38] let's hope it won't go back up in several minutes [15:38:45] thanks for working on it during weekend elukey :) [15:39:25] !log restart php-fpm on mw1405 mw1399 mw1385 [15:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:33] urbanecm: I can say the same thing for you :) [15:39:59] ok I should have restarted the bad workers [15:40:04] and latency is dropping as expected [15:40:13] i'm not really working on the latency in any way. I only opened the graph and pinged you :) [15:40:36] but at this point we may get into the weird state again, but I'd love some confirmation from other SREs before doing a roll restart :D [15:40:57] urbanecm: you always work a ton on Mediawiki so we should say thank you every now and then :) [15:41:13] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:42:46] well, that feels nice to hear. i'm glad i'm helpful. [15:43:13] !log restart php-fpm on mw1393 [15:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:27] so afaics from the sal --^ I haven't restarted the same node two times [15:43:53] that makes me think that after the restart the weird state gets cleared [15:44:05] let's hope [15:45:03] we have 65 appservers in eqiad, lemme count how many I have restarted [15:45:32] seems 31 more or less [15:45:45] so potentially half of them are still in the weird state [15:46:19] going to check in half an hour :) [15:47:12] ttl urbanecm! [15:47:30] ttyl elukey [16:37:08] !log restart php-fpm on mw1387 [16:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:27] (03PS1) 10Ladsgroup: mailman: Enable verp probes [puppet] - 10https://gerrit.wikimedia.org/r/701658 (https://phabricator.wikimedia.org/T285361) [16:51:43] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) Also this documentation is useful https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/model/docs/bounce.html#verp-probes [16:51:50] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) @Legoktm It looks easy but do you have a way to test it? We can just enable it in production and see if it solves the issue, otherwise the only way to test it is to s... [17:33:11] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:48:10] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Legoktm) >>! In T285361#7178885, @Ladsgroup wrote: > @Legoktm It looks easy but do you have a way to test it? We can just enable it in production and see if it solves the issue,... [18:08:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:31:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:33:51] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:44:39] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:00:03] this one seems not related to what happened before --^ [19:10:02] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) okay, let me give it a try [19:10:57] PROBLEM - MD RAID on mw2380 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:10:58] ACKNOWLEDGEMENT - MD RAID on mw2380 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T285603 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:11:02] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10ops-monitoring-bot) [19:51:54] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) Got this: ` Jun 26 19:47:59 2021 (16060) Member ladsgroup+test@gmail.com on list test70.polymorphic.lists.wmcloud.org, bounce score 71 >= threshold 5, sending probe.... [20:00:39] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) There is nothing in exim4 logs though :/ [20:19:59] PROBLEM - Device not healthy -SMART- on mw2380 is CRITICAL: cluster=jobrunner device=sda instance=mw2380 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2380&var-datasource=codfw+prometheus/ops [20:24:32] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) I got this in exim4: ` 2021-06-26 19:31:03 1lxE19-0003pA-FQ <= test4-bounces+ladsgroup+mailmanroot=gmail.com@polymorphic.lists.wmcloud.org H=localhost (mailman03.mail... [20:53:06] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.56 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701663 [20:53:28] (03PS1) 10Ladsgroup: labs: Set json for metadata array and split metadata to ES when needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701664 (https://phabricator.wikimedia.org/T275268) [20:56:45] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701664 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [20:57:24] (03Merged) 10jenkins-bot: labs: Set json for metadata array and split metadata to ES when needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701664 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [20:58:04] rebased ^ [20:59:55] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.56 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701663 (owner: 10Volans) [21:05:27] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.56 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701663 (owner: 10Volans) [21:08:28] (03PS1) 10Volans: Upstream release v0.0.56 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/701665 [21:12:48] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.56 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/701665 (owner: 10Volans) [21:18:14] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10faidon) Thank you @jbond for raising this topic! To noone's surprise, o... [21:23:07] !log uploaded spicerack_0.0.56 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [21:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:02] !log upgraded spicerack to v0.0.56 on the cumin hosts (includes only bug fixes for the switchdc) [21:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:02] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Ladsgroup) I'm honestly lost for know if you want to take a stab at it. [23:21:07] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook