[00:08:56] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:09:25] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Patch-For-Review: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10Legoktm) @AntiCompositeNumber does this merit a #user-notice ahead of time or is it subtle enough that people won't notice? [00:10:46] 'subtle enough that people people won't notice' - but it's Wikipedia (tm) :) next to impossible [00:11:47] * ryankemper laughs [00:12:10] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [00:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:18] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [00:13:01] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Patch-For-Review: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10Legoktm) On buster, I see liberation2 installing: ` /usr/share/fonts/truetype/liberation2/LiberationMono-Bold.ttf /usr/share/fonts/t... [00:13:17] !log T292814 Write queue stuck at 133 events in partition 1 of topic `codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite`, will try again at another time [00:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:41] my test case for static-bugzilla content when inside a container: curl localhost:8080/bug1.html 2>/dev/null | grep "teh suck". Why this? It's Brion Vibber's famous words "Our docs are teh suck. Fix them up." which was true then and will probably be open forever. It's the original description of Bug 1 of original Bugzilla and the date was 2004-08-10. The bug tracker is closed but the bug [00:20:47] is still status NEW in the dump from it. [00:22:00] :D [00:23:52] you know what we ended up doing.. "closed as invalid".. aww :) [00:23:54] https://phabricator.wikimedia.org/T2001 [00:23:55] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.force-unfreeze [00:23:55] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.force-unfreeze (exit_code=99) [00:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:20] but only like 3 subtasks are open :p [00:25:50] (03CR) 10Legoktm: "The puppet change itself looks fine, I left some questions on the bug about the switch." [puppet] - 10https://gerrit.wikimedia.org/r/728568 (https://phabricator.wikimedia.org/T253600) (owner: 10AntiCompositeNumber) [00:28:09] (03PS2) 10Legoktm: mailman: Drop mailman module and move them to profile::lists [puppet] - 10https://gerrit.wikimedia.org/r/725436 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [00:28:11] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [00:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:17] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [00:30:12] (03CR) 10Legoktm: [C: 03+2] "PS2 was a manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/725436 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [00:41:22] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:28] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:01] !log ms-be2045 - started systemd-timedated which had been killed by something [00:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:19] (03CR) 10Dzahn: "this is actually already working:(!) curl -H "Accept-Encoding: gzip" localhost:8080/bug1.html -i shows "Content-Encoding: gzip" and curl" [container/miscweb] - 10https://gerrit.wikimedia.org/r/698070 (owner: 10Dzahn) [01:11:25] (03Abandoned) 10Dzahn: static-bugzilla: add config to serve compressed HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/698070 (owner: 10Dzahn) [01:12:32] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:34] (03PS2) 10Dzahn: static-bugzilla: compress bug HTML with gzip and add 10k more bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/728668 (https://phabricator.wikimedia.org/T281538) [01:32:16] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [01:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:22] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [01:32:53] (03PS3) 10Dzahn: static-bugzilla: compress bug HTML with gzip and add 10k more bugs [container/miscweb] - 10https://gerrit.wikimedia.org/r/728668 (https://phabricator.wikimedia.org/T281538) [01:33:46] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:16] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:42] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:12] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Patch-For-Review, 10User-notice: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10AntiCompositeNumber) From https://github.com/liberationfonts/liberation-sans-narrow#current-release: >Note: 2.00.0... [02:16:02] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:04] (03PS2) 10AntiCompositeNumber: mediawiki::packages::fonts: replace fonts-liberation with fonts-liberation2 [puppet] - 10https://gerrit.wikimedia.org/r/728568 (https://phabricator.wikimedia.org/T253600) [02:34:16] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:20] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:32] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:52] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:58] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:20] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:54] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:18] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:26] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:52] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:14] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:28] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: bump envoy CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/728625 (owner: 10Effie Mouzeli) [04:19:53] (03Merged) 10jenkins-bot: mwdebug: bump envoy CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/728625 (owner: 10Effie Mouzeli) [04:20:06] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:22:14] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:28:56] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:18] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:32] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:22] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:28] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:30] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:34] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:51:49] (03CR) 10Effie Mouzeli: [C: 04-1] "A few nits" [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 (owner: 10Legoktm) [05:52:58] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:12] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:14] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:18] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:42] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:42] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:46] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:04:18] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:48] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:27:28] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:48] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:12] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:46] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:38] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:38] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:58] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:08] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:30] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:24] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:48] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:04] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:58] Emperor: if you're around, can you downtime ^ [09:40:26] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:22] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:14] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:23] RhinosF1: will do, sorry [10:13:58] done (I presume side-effect of running stress) [10:20:50] Emperor: np, assume so too. It started just after. Have a good weekend. [10:33:38] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:24] (should be in downtime 'til Monday) [11:06:10] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:08:18] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:16:47] foks: You have a sec? I have a question for you, if you're able to answer. If not, maybe you'll know someone who does. [16:17:31] Bsadowski1: Maybe you might be able to answer my question? [16:17:49] I know that Seddon probably could ;-) [16:19:21] Oh what the hell? It's working now? [16:19:58] I was unable to log into the developer single-sign on. My "Oshwah" account was denied due to "missing privileges". I tried again just now, and suddenly, it let me in.... [16:20:24] Oshwah: what are you trying to do? [16:20:34] Oh I get it now... [16:20:49] majavah: I'm suddenly denied access to log into Grafana. [16:20:56] I used to be able to log in just fine... [16:21:56] as long as I remember grafana login (=editing, not browsing) has required membership in the cn=wmf or cn=nda ldap groups, https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#NDA_group [16:23:06] Interesting. Maybe I once was in this group, perhaps accidentally, and they fixed it and yanked my access? [16:23:11] ¯\_(ツ)_/¯ [16:23:37] Yeah you shouldn't be able to login but shouldn't need too [16:26:38] Weird. Maybe I'm just slowly going crazy or something... I dunno. [16:41:51] Oshwah: you only just found that out? [16:42:35] RhinosF1: HA no, I've known that I'm crazy for a long time. It's the level of craziness that I'm slowly advancing to over time.... [16:42:53] Oshwah: oh definately [16:43:09] I'm same too [16:50:54] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:54] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:10:33] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [18:11:23] (03CR) 10jerkins-bot: [V: 04-1] add extendedconfimed for autoreview group on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [18:32:52] (03CR) 10Juan90264: [C: 04-1] "Fix the code following what Zabe said" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [19:00:45] (03PS4) 10Rafael: add extendedconfimed for autoreview group on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) [19:03:33] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [19:10:04] (03CR) 10Juan90264: [C: 03+1] "Great, now just schedule the deployment at https://wikitech.wikimedia.org/wiki/Deployments, in the table that contains "backport" making s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [20:15:47] if anyone is around, https://phabricator.wikimedia.org/T292914 looks worrisome. I can't investigate myself. [20:41:24] hey Nikerabbit, I'm around, but I'm not sure I understand it properly -- the worrisome issue is the "page does not exist" thing? [20:42:11] (03CR) 10RhinosF1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728774 (https://phabricator.wikimedia.org/T292915) (owner: 10Rafael) [20:42:19] (03CR) 10RhinosF1: [C: 03+1] Set autoconfirmedextended and confirmedextended for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728774 (https://phabricator.wikimedia.org/T292915) (owner: 10Rafael) [20:43:29] (03CR) 10RhinosF1: [C: 03+1] "Noting that on deploy the old group should be emptied. Test of auto promotion can be done by anyone who meets criteria and should be promo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728774 (https://phabricator.wikimedia.org/T292915) (owner: 10Rafael)