[00:01:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:16] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:34] 10SRE, 10Wikimedia-Mailing-lists: Part of the Mailman interface is appearing in Dutch, even though its set to English - https://phabricator.wikimedia.org/T290903 (10Legoktm) [00:07:57] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: mailman3: Let users choose the UI language - https://phabricator.wikimedia.org/T281747 (10Legoktm) [00:11:40] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Niharika) @sbassett My bad -- sounds like there has been a mix-up of threads here. I was talking about the next steps for getting MaxMi... [00:19:40] PROBLEM - SSH on ms-fe2006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:21:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:46] (03CR) 10Legoktm: irc: Split long !log lines (033 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) (owner: 10Legoktm) [00:41:50] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The following units failed: session-195901.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:51] (03PS3) 10Legoktm: irc: Split long !log lines [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) [00:45:32] (03CR) 10jerkins-bot: [V: 04-1] irc: Split long !log lines [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) (owner: 10Legoktm) [00:53:12] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:58:08] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:30] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:09:06] PROBLEM - Check systemd state on ms-be2050 is CRITICAL: CRITICAL - degraded: The following units failed: session-195948.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:25] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10sbassett) >>! In T288844#7350579, @Niharika wrote: > @sbassett My bad -- sounds like there has been a mix-up of threads here. I was tal... [01:14:34] RECOVERY - Check systemd state on ms-be2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T0200) [02:03:05] Whee. [02:03:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.23 [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/720834 [02:06:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.23 [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/720834 (owner: 10TrainBranchBot) [02:07:00] !log wmf/1.37.0-wmf.23 was branched at ea72c9b690c2159a12beec2f518b61cc499ed521 for T281164 [02:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:05] T281164: 1.37.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T281164 [02:07:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:18] RECOVERY - SSH on ms-fe2006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:31:12] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.23 [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/720834 (owner: 10TrainBranchBot) [02:39:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:52] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:00:01] (03PS6) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) [03:21:28] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:28:46] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 37 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:48:28] (03CR) 10Cwhite: alerts: copy 'stat' for alert rules on deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720243 (owner: 10Filippo Giunchedi) [04:21:32] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:24:22] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-be1062, ms-be1051, an-web1001, an-worker1096, labstore1006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:25:14] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 423, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:32:24] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:06:34] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: an-web1001, labstore1006, ms-be1051, an-worker1096, ms-be1062 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:14:26] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 106 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:31:16] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:44:26] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The following units failed: session-196013.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:30] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:59:56] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:03:26] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-be1051, ms-be1062, an-worker1096, labstore1006, an-web1001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:04:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add tests to exercise uses of the php symlink in operations/mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/720817 (https://phabricator.wikimedia.org/T285298) (owner: 10Ahmon Dancy) [06:07:36] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 30 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:10:19] from a quick look parsoid was emitting timeouts for a crhwiki URL [06:11:26] elukey: and always the same URL [06:11:35] hola :) [06:11:47] hola caracola [06:12:06] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The following units failed: session-196039.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:44] I am a little concerned about the elastic alarms, I pinged the Search team [06:12:59] is it a fallout of yesterday's restarts? [06:13:21] I don't know [06:14:13] going to check [06:16:54] it may be a config issue, I see a lot of warnings in the logs [06:30:13] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add comment for usage of underscores/spaces [software/benchmw] - 10https://gerrit.wikimedia.org/r/720685 (owner: 10Alexandros Kosiaris) [06:34:20] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:16] 10SRE, 10cloud-services-team (Kanban): Puppet on labstore1006 seems to update html files on every run - https://phabricator.wikimedia.org/T290943 (10elukey) [06:39:43] (03PS3) 10Alexandros Kosiaris: Update bench urls and improve url labels [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 (owner: 10Krinkle) [06:40:15] (03PS4) 10Alexandros Kosiaris: Update bench urls and improve url labels [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 (owner: 10Krinkle) [06:41:29] (03PS5) 10Marostegui: wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) [06:41:34] (03CR) 10Alexandros Kosiaris: Update bench urls and improve url labels (031 comment) [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 (owner: 10Krinkle) [06:42:34] (03PS5) 10Alexandros Kosiaris: Update bench urls and improve url labels [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 (owner: 10Krinkle) [06:43:04] (03PS3) 10Marostegui: Conftool-sections: farewell s10 [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [06:51:31] (03CR) 10Muehlenhoff: sre.experimental.reimage: improve unmask message (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 (owner: 10Volans) [06:52:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/720793 (owner: 10Volans) [06:58:43] (03PS2) 10Volans: sre.experimental.reimage: improve unmask message [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 [06:59:04] (03CR) 10Volans: "Addressed comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 (owner: 10Volans) [06:59:54] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:42] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10jcrespo) Hey, @cmooney, I edited LDAP wmf group after getting an alarm, "uid\3Djjbk" (unknown user had been added to wmf group). I changed that to "jjbk". Cheers. [07:06:46] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23): Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Joe) I think it would be interesting to actually do what @TK-999 suggested and actually intercept all HTTP requests and route the... [07:13:49] 10SRE, 10cloud-services-team (Kanban): Puppet on labstore1006 seems to update html files on every run - https://phabricator.wikimedia.org/T290943 (10ArielGlenn) Let me have a look to see what puppet thinks it needs to change. What about labstore1007, the same or are things ok there? [07:18:08] 10SRE, 10cloud-services-team (Kanban): Puppet on labstore1006 seems to update html files on every run - https://phabricator.wikimedia.org/T290943 (10elukey) It seems so yes, but I didn't see it listed in the icinga alert for some reason :( [07:27:52] (03PS1) 10JMeybohm: WIP: istio additions [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 [07:29:33] 10SRE, 10VPS-project-Codesearch, 10HTTPS: Codesearch main page redirect uses http instead of https - https://phabricator.wikimedia.org/T290819 (10ema) codesearch is not behind our prod CDN, removing #traffic tag. [07:34:38] (03PS1) 10DCausse: elasticsearch: Fix cirrus_settings_check [puppet] - 10https://gerrit.wikimedia.org/r/720908 [07:35:35] 10SRE, 10cloud-services-team (Kanban): Puppet on labstore1006 seems to update html files on every run - https://phabricator.wikimedia.org/T290943 (10ArielGlenn) Old versions of these files were being rsynced regularly from dumpsdata1001; this is likely a leftover from when this was the original source. I've re... [07:40:55] (03CR) 10Volans: [C: 04-1] "This implementation will fail if any multi-byte character is around the split point and will get broken in two by the split:" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) (owner: 10Legoktm) [07:43:22] (03CR) 10DCausse: "While writing this patch I realized that we do manage the crosscluster config from puppet (c.f. T213150). I can't remember if it was on pu" [puppet] - 10https://gerrit.wikimedia.org/r/720908 (owner: 10DCausse) [07:47:05] (03PS2) 10Filippo Giunchedi: alerts: copy metadata for alert rules on deploy [puppet] - 10https://gerrit.wikimedia.org/r/720243 [07:47:23] (03CR) 10Filippo Giunchedi: alerts: copy metadata for alert rules on deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720243 (owner: 10Filippo Giunchedi) [07:48:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice job!" [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [07:49:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 (owner: 10Krinkle) [07:50:24] !log update acme-chief to version 0.31 on acmechief hosts - T290249 [07:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:30] T290249: Support OCSP stapling from prefetched responses in HAProxy - https://phabricator.wikimedia.org/T290249 [07:50:40] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested, I 've generated new data and are working with that now. I 've added some minor aesthetic correction for gnuplot and now LGTM, merg" [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 (owner: 10Krinkle) [07:52:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one typo inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/720787 (owner: 10Volans) [07:52:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 (owner: 10Volans) [07:53:04] (03PS1) 10Volans: pylint: fix newly reported issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720910 [07:53:13] (03PS1) 10Alexandros Kosiaris: Set yrange for percentiles, setting lower limit to 0 [software/benchmw] - 10https://gerrit.wikimedia.org/r/720911 [07:53:15] (03PS1) 10Alexandros Kosiaris: Allow calculating multiple percentiles [software/benchmw] - 10https://gerrit.wikimedia.org/r/720912 [07:54:11] (03PS2) 10Volans: sre.experimental.reimage: check also Icinga status [cookbooks] - 10https://gerrit.wikimedia.org/r/720787 [07:54:13] (03PS2) 10Volans: sre.experimental.reimage: print results to console [cookbooks] - 10https://gerrit.wikimedia.org/r/720793 [07:54:16] (03PS3) 10Volans: sre.experimental.reimage: improve unmask message [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 [07:54:22] (03PS1) 10Ema: rsyslog: config sanity check as systemd override [puppet] - 10https://gerrit.wikimedia.org/r/720913 (https://phabricator.wikimedia.org/T290870) [07:54:24] (03CR) 10Volans: "Addressed comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/720787 (owner: 10Volans) [07:59:53] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31057/console" [puppet] - 10https://gerrit.wikimedia.org/r/720913 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [08:00:38] (03CR) 10Volans: "This fixes the pylint issues you got on the other CR" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720910 (owner: 10Volans) [08:03:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/720787 (owner: 10Volans) [08:05:07] !log wipe non-os partitions from ms-be2045 - T290881 [08:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:13] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [08:06:36] I am running the preliminary train tasks like cloning repos / applying patches etc [08:07:16] (03CR) 10Vgutierrez: [C: 03+2] sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:23:10] RECOVERY - Disk space on ms-be2045 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2045&var-datasource=codfw+prometheus/ops [08:24:11] !log train: applied security patches for 1.37.0-wmf.23 # T281164 [08:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:16] T281164: 1.37.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T281164 [08:24:28] (03PS1) 10Vgutierrez: Revert "sslcert: Provide chained TLS cert with private key" [puppet] - 10https://gerrit.wikimedia.org/r/720849 [08:25:04] !log poweroff ms-be2045 and set it as failed in netbox - T290881 [08:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:09] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [08:26:05] (03PS1) 10Volans: confctl: fix example code in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/720914 [08:26:18] (03CR) 10Vgutierrez: [C: 03+2] Revert "sslcert: Provide chained TLS cert with private key" [puppet] - 10https://gerrit.wikimedia.org/r/720849 (owner: 10Vgutierrez) [08:27:06] (03PS1) 10Hashar: testwikis wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720915 [08:27:08] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720915 (owner: 10Hashar) [08:27:55] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720915 (owner: 10Hashar) [08:27:59] !log hashar@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.23 [08:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2002.codfw.wmnet [08:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:36] PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:17] (03PS1) 10MVernon: codfw-prod: remove host ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720917 (https://phabricator.wikimedia.org/T290881) [08:35:41] (03CR) 10Filippo Giunchedi: [C: 03+1] codfw-prod: remove host ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720917 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [08:37:11] (03CR) 10Michael Große: [C: 03+1] Don’t check constraints on two property qualifiers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) (owner: 10Lucas Werkmeister (WMDE)) [08:39:44] (03PS1) 10Filippo Giunchedi: README.md: mention TARGETS deploy option [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720920 [08:42:21] 10SRE, 10SRE Observability, 10Patch-For-Review: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) By going through rsyslog [[https://github.com/rsyslog/rsyslog/issues | upstream bugs ]] I found out about [[ https://github.com/rsyslog/rsyslog/issues/78 |... [08:42:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2002.codfw.wmnet [08:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:29] (03CR) 10MVernon: [C: 03+2] codfw-prod: remove host ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720917 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [08:44:57] (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: remove host ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720917 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [08:45:28] (03PS1) 10Ema: rsyslog: abort on unclean config [puppet] - 10https://gerrit.wikimedia.org/r/720921 (https://phabricator.wikimedia.org/T290870) [08:46:43] (03CR) 10Ema: "Alternative to https://gerrit.wikimedia.org/r/c/operations/puppet/+/720913/." [puppet] - 10https://gerrit.wikimedia.org/r/720921 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [08:47:11] (03PS2) 10Filippo Giunchedi: README.md: mention TARGETS deploy option [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720920 [08:47:43] !log installing testvm2002 [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:18] (03PS1) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/720924 (https://phabricator.wikimedia.org/T290005) [09:03:00] (03PS14) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [09:03:02] (03PS12) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [09:03:20] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:04:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:04:20] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:25] (03PS1) 10Volans: tests: fix typo in test name [software/cumin] - 10https://gerrit.wikimedia.org/r/720926 [09:06:21] (03CR) 10Volans: [C: 03+2] "documentation-only commit, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720914 (owner: 10Volans) [09:07:59] (03CR) 10MVernon: [C: 03+1] "LGTM :)" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720920 (owner: 10Filippo Giunchedi) [09:09:10] !log swift rebalance to remove h/w faulty host ms-be2045 T290881 [09:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:16] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [09:09:33] (03CR) 10Filippo Giunchedi: [C: 03+2] README.md: mention TARGETS deploy option [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720920 (owner: 10Filippo Giunchedi) [09:09:36] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] README.md: mention TARGETS deploy option [software/swift-ring] - 10https://gerrit.wikimedia.org/r/720920 (owner: 10Filippo Giunchedi) [09:10:23] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2002.codfw.wmnet [09:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:50] (03CR) 10jerkins-bot: [V: 04-1] tests: fix typo in test name [software/cumin] - 10https://gerrit.wikimedia.org/r/720926 (owner: 10Volans) [09:13:00] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The following units failed: session-196111.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:02] (03Merged) 10jenkins-bot: confctl: fix example code in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/720914 (owner: 10Volans) [09:14:11] (03PS1) 10Kormat: debian: Fix lintian issues. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 [09:16:37] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31058/console" [puppet] - 10https://gerrit.wikimedia.org/r/720924 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:17:51] (03CR) 10Ema: [V: 03+1 C: 03+1] sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/720924 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:19:23] (03PS1) 10Vgutierrez: acme_chief,cfssl: Prevent key material from being backuped by puppet [puppet] - 10https://gerrit.wikimedia.org/r/720928 [09:20:12] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:53] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: set image recommendation API URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) [09:21:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31059/console" [puppet] - 10https://gerrit.wikimedia.org/r/720928 (owner: 10Vgutierrez) [09:21:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2002.codfw.wmnet [09:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:16] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2002.codfw.wmnet` - testvm2002.codfw.wmnet (**WARN**) - //Host not found on Icinga... [09:22:49] (03PS2) 10Volans: tests: fix typo in test name [software/cumin] - 10https://gerrit.wikimedia.org/r/720926 [09:22:51] (03PS1) 10Volans: pylint: remove unnecessary disable comments [software/cumin] - 10https://gerrit.wikimedia.org/r/720930 [09:23:32] (03CR) 10Vgutierrez: [C: 03+2] sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/720924 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:24:54] (03PS1) 10Kormat: debian: Add lintian overrides [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/720931 [09:26:02] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: session-196170.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:19] (03PS1) 10MMandere: pontoon: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/720932 (https://phabricator.wikimedia.org/T282787) [09:26:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/720928 (owner: 10Vgutierrez) [09:28:41] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] acme_chief,cfssl: Prevent key material from being backuped by puppet [puppet] - 10https://gerrit.wikimedia.org/r/720928 (owner: 10Vgutierrez) [09:29:45] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2001.codfw.wmnet [09:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! None of this is in production so feel free to merge at will" [puppet] - 10https://gerrit.wikimedia.org/r/720932 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:35:39] (03CR) 10Volans: [C: 03+2] "Just comments, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/720930 (owner: 10Volans) [09:35:57] (03CR) 10Volans: [C: 03+2] "Trivial typo in test name, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/720926 (owner: 10Volans) [09:38:38] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.23 (duration: 70m 39s) [09:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:44] (03CR) 10MMandere: pontoon: Add drmrs DC Site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720932 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:39:49] (03CR) 10MMandere: [C: 03+2] pontoon: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/720932 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:40:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2001.codfw.wmnet [09:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:29] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2001.codfw.wmnet` - testvm2001.codfw.wmnet (**PASS**) - Downtimed host on Icinga... [09:40:41] (03Merged) 10jenkins-bot: pylint: remove unnecessary disable comments [software/cumin] - 10https://gerrit.wikimedia.org/r/720930 (owner: 10Volans) [09:41:02] (03Merged) 10jenkins-bot: tests: fix typo in test name [software/cumin] - 10https://gerrit.wikimedia.org/r/720926 (owner: 10Volans) [09:41:29] (03PS1) 10Vgutierrez: dotls: Benefit from HAProxy support on acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/720936 [09:45:36] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31060/console" [puppet] - 10https://gerrit.wikimedia.org/r/720936 (owner: 10Vgutierrez) [09:46:30] PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The following units failed: session-177161.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:08] !log hashar@deploy1002 Pruned MediaWiki: 1.37.0-wmf.19 (duration: 04m 13s) [09:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:55] (03CR) 10Vgutierrez: [V: 03+1] "# ls -alh /etc/acmecerts/dotls-for-authdns/live/ec-prime256v1.chained.crt.key*" [puppet] - 10https://gerrit.wikimedia.org/r/720936 (owner: 10Vgutierrez) [09:49:06] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:52:05] (03PS1) 10MMandere: wmcs::monitoring: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/720939 (https://phabricator.wikimedia.org/T282787) [09:52:27] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10cmooney) @jcrespo hey thanks. sorry I must have got the syntax wrong when running the modify-ldap-group. Thanks for spotting and fixing :) [09:54:07] (03CR) 10Kosta Harlan: [beta] GrowthExperiments: set image recommendation API URL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [09:54:15] (03PS1) 10Muehlenhoff: Disable scope cleanup cron on Thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) [09:55:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) (owner: 10Muehlenhoff) [09:55:42] (03CR) 10jerkins-bot: [V: 04-1] Disable scope cleanup cron on Thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) (owner: 10Muehlenhoff) [09:58:39] (03PS2) 10Muehlenhoff: Disable scope cleanup cron on Thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) [09:58:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) (owner: 10Muehlenhoff) [10:00:10] (03CR) 10jerkins-bot: [V: 04-1] Disable scope cleanup cron on Thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) (owner: 10Muehlenhoff) [10:00:11] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:49] (03PS9) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [10:05:23] !log hashar@deploy1002 Pruned MediaWiki: 1.37.0-wmf.20 (duration: 01m 48s) [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:45] (03PS9) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [10:06:11] (03PS11) 10Elukey: WIP - kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [10:10:33] (03PS3) 10Muehlenhoff: Disable scope cleanup cron on Thanos backends [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) [10:11:04] (03PS1) 10Elukey: Add ml_services secrets to kubernetes config [labs/private] - 10https://gerrit.wikimedia.org/r/720943 [10:11:27] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add ml_services secrets to kubernetes config [labs/private] - 10https://gerrit.wikimedia.org/r/720943 (owner: 10Elukey) [10:14:43] (03PS1) 10Giuseppe Lavagetto: Make the apple dictionary bridge work in kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720944 (https://phabricator.wikimedia.org/T288848) [10:14:59] (03PS12) 10Elukey: WIP - kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [10:16:24] (03CR) 10Majavah: [C: 04-1] "https://phabricator.wikimedia.org/T289224" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720944 (https://phabricator.wikimedia.org/T288848) (owner: 10Giuseppe Lavagetto) [10:16:37] (03CR) 10jerkins-bot: [V: 04-1] Make the apple dictionary bridge work in kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720944 (https://phabricator.wikimedia.org/T288848) (owner: 10Giuseppe Lavagetto) [10:17:46] (03PS13) 10Elukey: WIP - kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [10:19:38] (03Abandoned) 10Giuseppe Lavagetto: Make the apple dictionary bridge work in kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720944 (https://phabricator.wikimedia.org/T288848) (owner: 10Giuseppe Lavagetto) [10:19:53] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:59] (03CR) 10Elukey: "pcc looks reasonable: https://puppet-compiler.wmflabs.org/compiler1001/31063/" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:26:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) (owner: 10Muehlenhoff) [10:30:41] RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [10:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:09] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) >>! In T285232#7349608, @dancy wrote: > Still not working: https://foundation.wikimedia.org/static/current/sk... [10:39:03] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Last round of urls, same configuration, with the addition of a couple more requests: [[https://gerrit.wikimedia.org/r/c/operat... [10:45:28] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) >>! In T288937#7348155, @fgiunchedi wrote: > The other problem I noticed, though not specific to thanos but rather ferm + pontoon, is that `@resolve` calls will fail in WMCS: To cl... [10:47:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2001.codfw.wmnet [10:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:20] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [10:58:55] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1100). [11:00:05] No Gerrit patches in the queue for this window AFAICS. [11:00:53] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:02:36] I have a config patch but it can probably wait until tomorrow [11:04:11] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:05:21] (03PS10) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [11:05:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/720940 (https://phabricator.wikimedia.org/T199911) (owner: 10Muehlenhoff) [11:12:06] (03PS14) 10Elukey: WIP - kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [11:15:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31065/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [11:16:14] (03PS2) 10Urbanecm: enwiki: Bump Growth features to 25% (mentorship limited to 20% of those users) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720825 (https://phabricator.wikimedia.org/T290927) [11:17:53] (03PS1) 10Elukey: Add fake admin token to k8s ml-services [labs/private] - 10https://gerrit.wikimedia.org/r/720955 [11:18:07] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake admin token to k8s ml-services [labs/private] - 10https://gerrit.wikimedia.org/r/720955 (owner: 10Elukey) [11:19:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31066/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [11:22:05] (03PS11) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [11:22:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [11:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:39] (03PS15) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [11:27:19] (03CR) 10Elukey: kubernetes: add revscoring-editquality in the services configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [11:33:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [11:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:37] PROBLEM - Check systemd state on ganeti2026 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:58] (03CR) 10Elukey: Improve the Kerberos automatic renewal service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [11:43:40] (03CR) 10Btullis: Improve the Kerberos automatic renewal service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [11:45:06] (03PS12) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [11:46:43] PROBLEM - Check systemd state on ms-be2041 is CRITICAL: CRITICAL - degraded: The following units failed: session-196327.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:09] 10SRE, 10Analytics, 10Data-Engineering, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10kostajh) Moving this to #growth-team's triaged column as @Mholloway is working on this; if there is something sp... [11:54:01] (03PS1) 10Majavah: P::toolforge::mailrelay: listen on additional ports with tls [puppet] - 10https://gerrit.wikimedia.org/r/720962 [11:55:35] (03CR) 10Arturo Borrero Gonzalez: P::toolforge::mailrelay: listen on additional ports with tls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720962 (owner: 10Majavah) [11:58:45] (03CR) 10JMeybohm: [C: 03+1] helmfile.d/admin make tiller components configurable per environment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:59:37] (03CR) 10Majavah: P::toolforge::mailrelay: listen on additional ports with tls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720962 (owner: 10Majavah) [12:00:05] Deploy window Pre-DC switch-over deploy freeze (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1200) [12:01:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P::toolforge::mailrelay: listen on additional ports with tls [puppet] - 10https://gerrit.wikimedia.org/r/720962 (owner: 10Majavah) [12:11:14] (03CR) 10Jelto: [C: 03+2] helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:15:31] (03Merged) 10jenkins-bot: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:17:37] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:42] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:00] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:19:04] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:12] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:16] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:24] RECOVERY - Check systemd state on ms-be2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:32] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:32:37] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:06] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) [12:39:31] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) [12:43:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::packages: remove libvips [puppet] - 10https://gerrit.wikimedia.org/r/720974 (https://phabricator.wikimedia.org/T290759) [12:54:55] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) a:03Joe [12:58:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [cookbooks] - 10https://gerrit.wikimedia.org/r/720770 (owner: 10Volans) [13:02:35] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/720976 [13:02:42] (03CR) 10Filippo Giunchedi: "Adding WMCS folks, LGTM though" [puppet] - 10https://gerrit.wikimedia.org/r/720939 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:05:57] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/720980 [13:10:34] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/720981 [13:11:04] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks :)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/720931 (owner: 10Kormat) [13:11:30] (03CR) 10Kormat: [C: 03+2] debian: Add lintian overrides [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/720931 (owner: 10Kormat) [13:11:34] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) [13:12:30] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: improve conftool support [cookbooks] - 10https://gerrit.wikimedia.org/r/720770 (owner: 10Volans) [13:12:49] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: check also Icinga status [cookbooks] - 10https://gerrit.wikimedia.org/r/720787 (owner: 10Volans) [13:13:03] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: print results to console [cookbooks] - 10https://gerrit.wikimedia.org/r/720793 (owner: 10Volans) [13:13:14] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: improve unmask message [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 (owner: 10Volans) [13:14:15] (03Merged) 10jenkins-bot: debian: Add lintian overrides [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/720931 (owner: 10Kormat) [13:14:21] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The following units failed: session-196206.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:24] (03PS1) 10KartikMistry: WIP: Add support for SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) [13:14:52] (03PS1) 10Inductiveload: Enable change-tags for new edits' proofread status at mulWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) [13:14:58] (03Merged) 10jenkins-bot: sre.experimental.reimage: improve conftool support [cookbooks] - 10https://gerrit.wikimedia.org/r/720770 (owner: 10Volans) [13:15:09] (03Merged) 10jenkins-bot: sre.experimental.reimage: check also Icinga status [cookbooks] - 10https://gerrit.wikimedia.org/r/720787 (owner: 10Volans) [13:15:42] (03Merged) 10jenkins-bot: sre.experimental.reimage: print results to console [cookbooks] - 10https://gerrit.wikimedia.org/r/720793 (owner: 10Volans) [13:15:49] (03Merged) 10jenkins-bot: sre.experimental.reimage: improve unmask message [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 (owner: 10Volans) [13:16:27] PROBLEM - Check systemd state on ms-be2057 is CRITICAL: CRITICAL - degraded: The following units failed: session-78765.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:39] (03CR) 10MVernon: [C: 04-1] "Hi," [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 (owner: 10Kormat) [13:21:37] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) Using `fatal-error.php` I determined that at the moment logging to mwlog1001 works, while it seems that we're not able to log to logstash. I strongly suspect thi... [13:22:09] RECOVERY - Check systemd state on ms-be2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:30] (03PS2) 10Inductiveload: Enable change-tags for new edits' proofread status at mulWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) [13:24:51] 10SRE, 10SRE Observability, 10Patch-For-Review: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10fgiunchedi) Thank you for the investigation and the reviews! I 100% agree that rsyslog config deploys must be safer than they are now. I think either approach wil... [13:25:55] (03PS2) 10Kormat: debian: Fix lintian issues. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 [13:26:14] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@1ebdca4]: (no justification provided) [13:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:29] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@1ebdca4]: (no justification provided) (duration: 00m 15s) [13:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:34] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: kartotherian: restore v4 maxzoom to z15 [13:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:44] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: kartotherian: restore v4 maxzoom to z15 (duration: 00m 10s) [13:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:04] (03PS3) 10Inductiveload: Enable change-tags for new edits' proofread status at mulWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) [13:28:09] (03CR) 10Kormat: debian: Fix lintian issues. (031 comment) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 (owner: 10Kormat) [13:30:58] (03CR) 10Herron: [V: 03+2 C: 03+2] set default slo field values and remove duplicates [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717584 (owner: 10Herron) [13:31:20] (03CR) 10Herron: [V: 03+2 C: 03+2] slo_dashboards: add cluster_label_query and set default [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron) [13:33:17] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:44] (03CR) 10MVernon: [C: 03+1] "Cool, thanks" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 (owner: 10Kormat) [13:42:51] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@79bc0c6]: geoshapes: update table names [13:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:05] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@79bc0c6]: geoshapes: update table names (duration: 00m 14s) [13:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:54] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: kartotherian: restore v4 maxzoom to z15 [13:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:04] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: kartotherian: restore v4 maxzoom to z15 (duration: 00m 10s) [13:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:21] (03PS3) 10Kormat: debian: Fix lintian issues. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 [13:46:01] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Legoktm) @sgrabarczuk, @Trizek-WMF Hi, @AntiCompositeNumber pointed out that the centralnotice is up, but it seems to be pointing to {T289... [13:46:05] o/ the DC switchover is in ~15 minutes [13:46:29] (03CR) 10Kormat: "PTAL" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 (owner: 10Kormat) [13:47:49] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Redrose64) The notice that has just gone up at English Wikipedia reads ` Technical maintenance wi... [13:49:34] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Ciencia_Al_Poder) I'm seeing the banner right now, and it says the maintenance will be from 6:00 UTC to 6:30 UTC However, banner time is... [13:50:11] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) >>! In T287539#7351934, @Redrose64 wrote: > The notice that has just gone up at English Wikipedia reads > Yeah, I think t... [13:51:31] (03CR) 10MVernon: [C: 03+1] debian: Fix lintian issues. (031 comment) [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 (owner: 10Kormat) [13:52:05] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10herron) 05Open→03Resolved >>! In T289036#7323819, @ema wrote: > the `cluster` dropdown should only list `cache_text` and `cache_upload... [13:52:12] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Fix lintian issues. [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720927 (owner: 10Kormat) [13:52:48] (03PS1) 10Legoktm: Avoid warning about undefined $wgFileBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720986 [13:53:03] To join DC switch for Mediawiki, use the following command on cumin2002 (as root): tmux attach -rt dc-switch-mw [13:53:12] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Base) NB: I've just corrected the times to 14:00 — ­14:30: https://meta.wikimedia.org/w/index.php?title=MediaWiki:Centralnotice-template-r... [13:53:55] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Base) What was set to correct times was the campaign times, but the campaign used the same banner as from the 6th, and that banner should'... [13:56:30] args lgtm 👍 [13:57:54] (03PS13) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [13:59:05] (03PS4) 10Inductiveload: Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [13:59:22] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [14:00:01] I posted a link to the Meet screenshare in _security for non-roots who want to follow along [14:00:05] Deploy window Switch Datacenter - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1400) [14:00:11] marostegui: will the s6/wikitech situation work correctly with https://gerrit.wikimedia.org/g/operations/cookbooks/%2B/master/cookbooks/sre/switchdc/mediawiki/03-set-db-readonly.py ? [14:00:25] (specifically the bit about checking that the primary has caught up) [14:00:40] and I'm going to quickly deploy this MW train blocker config change so we have less logspam [14:00:51] kormat: probably no if it uses show slave status\G [14:00:54] as that returns nothing [14:00:56] non-roots with security access* [14:01:28] majavah: it gets the heartbeat [14:01:30] IIRC [14:01:35] sorry was for marostegui [14:01:43] kormat: separately - note we didn't merge https://gerrit.wikimedia.org/r/718936 so you'll need to downtime manually as last time [14:01:51] rzl: 👍 [14:01:59] volans: THen it should as there's only one heartbeat for s6 [14:01:59] The other one is from m5 [14:02:18] marostegui: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/mysql_legacy.py#90 [14:02:21] that's the query\ [14:02:33] volans: yes, then it should, no changes on that front [14:02:34] volans: ok great, thanks [14:03:05] kormat: good catch though <3 [14:03:19] (03PS2) 10Legoktm: Avoid warning about undefined $wgFileBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720986 (https://phabricator.wikimedia.org/T290640) [14:03:39] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:04:14] (03CR) 10Legoktm: [C: 03+2] Avoid warning about undefined $wgFileBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720986 (https://phabricator.wikimedia.org/T290640) (owner: 10Legoktm) [14:05:09] (03Merged) 10jenkins-bot: Avoid warning about undefined $wgFileBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720986 (https://phabricator.wikimedia.org/T290640) (owner: 10Legoktm) [14:05:17] jelto: parameters looks good fwiw [14:05:51] thanks! So any concerns to proceed with phase 0? [14:06:10] jelto: wait for a go-ahead from legoktm after that deploy [14:06:12] I tought we were waiting for legoktm with that patch [14:06:22] give me a minute, I'm testing on mwdebug2001 rn [14:06:31] <_joe_> I would assume phase0 would be ok anyways, but let's not rush it [14:06:32] no reason we couldn't disable puppet in the meantime but best to do one thing at a time [14:06:35] haha [14:07:01] <_joe_> rzl: well the warmup can be impacted [14:07:15] ack [14:08:27] (03PS14) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [14:09:25] syncing [14:10:52] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31069/console" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:10:55] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Avoid warning about undefined $wgFileBlacklist (T290640) (duration: 01m 32s) [14:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:03] T290640: Undefined variable: wgFileBlacklist - https://phabricator.wikimedia.org/T290640 [14:12:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:12:01] I have https://listen.hatnote.com up for audible RO/RW notifications as usual :D [14:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:15] hm [14:12:35] I think something is still broken [14:12:57] I don't see "PDF" on the list of permitted file types anymore https://commons.wikimedia.org/wiki/Special:Upload [14:13:14] * urbanecm neither [14:13:25] hashar: can we rollback wmf.23 from testwikis? [14:13:40] * volans takes advantage of the delay to spam here with few CRs [14:13:55] (03PS1) 10Volans: remote: remove RemoteHosts.init_system() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/720992 [14:13:57] (03PS1) 10Volans: remote: add support to enable/disable Cumin output [software/spicerack] - 10https://gerrit.wikimedia.org/r/720993 [14:13:59] (03PS1) 10Volans: dhcp: reduce verbosity of Cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/720994 [14:14:01] (03PS1) 10Volans: icinga: reduce verbosity of Cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/720995 [14:14:03] (03PS1) 10Volans: puppet: reduce verbosity of Cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/720996 [14:14:19] legoktm: FYI we should probably update the maintenance banner again -- our posted window for RO closes in 16 minutes and prep phases will take longer than that [14:14:25] {done} [14:14:34] <_joe_> legoktm: we can proceed even without the rollback though, right? [14:14:49] <_joe_> we just wanted to get rid of the logspam for now [14:15:14] (03CR) 10Elukey: [C: 03+1] pylint: fix newly reported issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720910 (owner: 10Volans) [14:15:40] rzl: I can put the banner on again if needed. Was set to end on 14:01 UTC. Just lmk the new end :). [14:15:41] I'm trying to see if I just broke pdf uploads on commons >.> [14:16:00] because my config change was global, but I don't understand how it oculd have affected commons [14:16:04] urbanecm: ack, thanks, stand by [14:16:07] <_joe_> oh ugh [14:17:04] ok, this is insane, but Commons has JS to hide "pdf" from the list of permitted file types [14:17:08] all seems to be fine [14:17:13] ??!?! [14:17:18] !!?!? [14:17:23] https://commons.wikimedia.org/wiki/MediaWiki:Upload.js search for "mw-upload-permitted" [14:17:26] that’s super deployer friendly [14:17:29] :// [14:17:30] very weird JS [14:17:34] glad we have safemode=1 [14:17:38] oh, upload.js, not surprising [14:17:57] that script is like half broken anyway [14:18:04] I verified pdf was in the HTML source of the page [14:18:19] so I'm set to go, any other blockers? [14:18:33] legoktm: only the banner question above [14:18:49] RECOVERY - MegaRAID on ms-be1062 is OK: OK: optimal, 25 logical, 25 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:18:56] urbanecm: can you extend the time on the banner? [14:19:00] yes [14:19:02] just let me know how long [14:19:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:29] (previous end time was 14:01 UTC, ftr) [14:20:03] (we'll need to update both the 14:01 end time for displaying it, and the 14:30 end time in the actual message) [14:20:06] why did it end before the maintenance even properly started? [14:20:21] it's fixed time [14:20:24] i can update the message too [14:20:27] urbanecm: can you put it until 15:00 and then we can take it down if we finish earlier? [14:20:33] doing [14:20:44] no other blockers afaict [14:21:25] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Legoktm) We got a delayed start today, so I asked @Urbanecm to extend the banner time to 15:00 UTC. [14:21:32] let's start with the phase 0 them? [14:21:33] change saved, should start displaying soon [14:21:42] ty [14:21:54] jelto: +1 from me to start [14:22:04] Ok I'll start with phase 0 [14:22:22] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [14:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:54] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [14:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:47] verified puppet is disabled [14:24:07] also note that mwmaint1002 that we're switching back to is now buster, not stretch [14:24:31] ok thanks. Then I reduce the ttl ok? [14:24:38] +1 [14:24:40] +! [14:24:43] +1, too [14:24:44] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [14:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:55] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Trizek-WMF) I changed the hour on the banner. [14:25:35] meh. Reverting trizek, the change is in a wrong direction. [14:26:48] :/ ok, please just keep the task updated [14:27:11] doing [14:28:11] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Urbanecm) >>! In T287546#7352075, @Trizek-WMF wrote: > I changed the hour on the banner. I can't change [[https://meta.wikimedia.org/wiki/... [14:28:52] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1062 - https://phabricator.wikimedia.org/T290416 (10Cmjohnson) 05Open→03Resolved The disk has been replaced and is back online cmjohnson@ms-be1062:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 Adapter 0: Created VD 22 Configured physical device a... [14:30:23] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [14:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:36] spot-checked a couple, ttls lgtm [14:31:46] same [14:31:54] (03PS15) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [14:32:05] +1 on moving to warmup caches [14:32:11] ok I proceed with warming up tha caches? [14:32:17] ack [14:32:17] +1 [14:32:20] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [14:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:27] we might get eqiad appserver latency alerts, no worries [14:34:19] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10fgiunchedi) [14:34:34] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [14:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:10] looks like about the same warmup timing that we saw in the live test [14:35:54] yeah, is everyone happy with those times? [14:36:05] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31070/console" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:37:23] they seem okish to me [14:38:59] I think we can stop maintenance now then? [14:39:09] +1 [14:39:21] I don't see anything on appserver graphs that I'm worried about [14:39:25] +1 to proceed [14:39:30] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:43] putting in the downtime for db primaries [14:39:58] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [14:39:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 37 hosts with reason: DC switchover [14:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:06] is the no such file or directory error ok? [14:40:24] yes [14:40:25] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Redrose64) Five minutes ago it said "15:00 UTC - 16:00 UTC". It's now reading "14:00 UTC - 15: 00 UTC". I get the impression that the devs... [14:40:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: DC switchover [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:52] (03CR) 10Btullis: [V: 03+1] "The value returned by the admin::kerberos_users function was inverted, such that it included only the system users, rather than the normal" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:41:07] (03CR) 10Cwhite: [C: 03+1] rsyslog: abort on unclean config [puppet] - 10https://gerrit.wikimedia.org/r/720921 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [14:41:09] * akosiaris doublechecking mwmaint [14:41:11] I don't see any php processes left on either mwmaint [14:41:20] (just php-fpm, which is expected) [14:41:34] +1 [14:42:07] so, the next steps should all happen in quick succession [14:42:10] I will proceed with 02-07 (including) quickly and will stop if you ping me or I see unexpected errors. Is that fine for everyone? [14:42:17] <_joe_> +1 [14:42:19] and dammit sessions are not in readis [14:42:19] +1 [14:42:21] redis* [14:42:27] we need to change that description text [14:42:34] +1 [14:42:35] <_joe_> akosiaris: but the cluster is still called "redis sessions" [14:42:46] let's discuss later :) [14:42:46] <_joe_> that's where the name comes from :P [14:42:46] isn't that confusing :-) [14:42:48] yeah, confusing, isn't it ? [14:43:01] anyway, one actionable for later I guess [14:43:13] <_joe_> ok, let's go, we have stopped maintenance, that's not without consequences [14:43:18] and somehow I think we 've had that discussion before. Either that or my brain is doing deja vu tricks [14:43:35] yeah 2-7 in quick succession, go go go! [14:43:36] I'm proceeding with 02-set-readonly all the way to 07 [14:43:40] +1 [14:43:43] 🚀 [14:43:47] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:43:48] !log jelto@cumin2002 MediaWiki read-only period starts at: 2021-09-14 14:43:48.272827 [14:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:09] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:44:11] edits stopped on listen-to-wikipedia [14:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] RO confirmed [14:44:26] confirmed fiwiki [14:44:39] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:44:42] same on eswiki [14:44:43] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:17] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:23] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [14:45:24] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [14:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:31] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:45:34] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:38] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:41] <_joe_> I see traffic in eqiad [14:45:53] RO message is gone [14:45:54] and i hear edits! [14:45:55] rw confirmed [14:45:57] I can write on eswiki [14:46:03] <_joe_> probelms with the api cluster [14:46:18] the timeouts for siteinfo are not expected [14:46:19] personaly impression- lots of latency (not metric based) [14:46:30] !log jelto@cumin2002 MediaWiki read-only period ends at: 2021-09-14 14:46:30.570035 [14:46:31] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:36] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:46] volans: agree, I think we've seen them previously though [14:47:04] let me know once it's ok to takeoff the maint banner [14:47:05] it is getting better, again- subjective impression [14:47:10] <_joe_> ok, it's some latency during the switchover, let's see if it settle down [14:47:25] <_joe_> response times are going down [14:47:25] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 723 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:47:25] 10SRE, 10DNS, 10Traffic: Additional DNS config for WikiLearn (urgent, AWS gives us 48 hours for verification) - https://phabricator.wikimedia.org/T290974 (10Vgutierrez) a:03Vgutierrez [14:47:51] some kind of monitoring artifact, the RED dashboard reports 4000% 5xx for POSTs in codfw [14:48:02] but that might also mean something is still sending POSTs there [14:48:04] <_joe_> rzl: yeah it's an artifact [14:48:08] (appserver cluster, that is) [14:48:11] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:48:25] The dbs are struggling a bit, but that's expected [14:48:27] <_joe_> can I suggest we run step 8? [14:48:32] exceptions sound to be all around "read only time", ftr [14:48:34] <_joe_> that will reduce the fatals noise [14:48:35] exception.log is just job runner stuff, so we should restart envoy [14:48:42] jelto: ^ [14:48:43] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 249 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:48:53] +1 to restart-envoy [14:49:01] ok restarting envoy [14:49:04] wikidata is having latency issues as expected, checking the dbs to see if we can help with weights [14:49:05] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [14:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:09] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:49:15] !log jelto@cumin2002 END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=99) [14:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:31] failed on mw2280 [14:49:38] <_joe_> mw2280 is down [14:49:44] yeah, expected [14:49:46] <_joe_> I don't know why it's still in cumin [14:50:01] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:50:03] was decommissioned? or it's just broken? [14:50:15] if broken stays in puppetdb for 2 weeks if not decomm'ed [14:50:15] broken [14:50:25] cumin just uses A:mw-jobrunner-codfw, it doesn't know anything about that or conftool state [14:50:35] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) @Joe I propose decoupling the webserver and app images by eliminating all uses of the 'php' symlink in oper... [14:51:33] <_joe_> jelto: I'd also start maintenance, or we cause some issues for instance to the wikidata dispatch lag [14:51:38] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in... [14:51:41] can we continue with maintenance? afaik all other (working) jobrunners are properly done [14:51:44] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10iamjessklein) Thanks so much! @jcrespo @cmooney @Aklapper I'm in and I see data dashboards so I believe it's working. [14:51:53] 16:47 let me know once it's ok to takeoff the maint banner <== didn't hear any "ack", making sure someone saw this :) [14:51:55] ok starting maintenance [14:51:57] yeah, start-maintenance sgtm [14:51:59] urbanecm: not yet [14:52:01] ok [14:52:04] +1 to maint [14:52:05] let's not yet [14:52:06] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:16] let's not start maintenance please [14:52:23] 10SRE, 10DNS, 10Traffic: Additional DNS config for WikiLearn (urgent, AWS gives us 48 hours for verification) - https://phabricator.wikimedia.org/T290974 (10Vgutierrez) we cannot provide a CNAME record on the apex of a zone, so we cannot fulfill: ` learn.wiki CNAME wkm-prod-alb-30644061.us-east-1.elb.amazona... [14:52:27] <_joe_> marostegui: oh? [14:52:36] cookbook is running already [14:52:37] <_joe_> we already did I think :/ [14:52:42] yeah, I see [14:52:48] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:52:49] we can rerun 01-stop-maintenance if need be [14:52:53] that would do the right thing here [14:53:09] <_joe_> marostegui: do we have dbs suffering? [14:53:14] (03PS1) 10Elukey: WIP - helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 [14:53:15] yes, s8 [14:53:16] <_joe_> rzl: +1 to that [14:53:21] jelto: ctrl+c [14:53:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db1109 load', diff saved to https://phabricator.wikimedia.org/P17269 and previous config saved to /var/cache/conftool/dbconfig/20210914-145324-marostegui.json [14:53:25] !log jelto@cumin2002 END (ERROR) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=97) [14:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:34] rzl: don't we need to run it with inverted DCs? [14:53:38] volans: no, it runs on both [14:53:40] just checked [14:53:42] ah right [14:53:53] <_joe_> yup and it makes sense [14:54:00] was checking if it was doing aanything speicific [14:54:02] <_joe_> let's re-run 01-stop-maintenance then [14:54:03] should I run 01-stop-maintenance again to make sure? [14:54:07] <_joe_> yeah [14:54:07] jelto: yes please [14:54:08] (once this is all done): marostegui let's talk about if there is anything on wikibase side is needed [14:54:15] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [14:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:25] <_joe_> marostegui: we'll wait for your ack to proceed further. [14:54:29] ack [14:54:43] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [14:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:11] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] ` [14:55:15] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: session-196309.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db1109 load', diff saved to https://phabricator.wikimedia.org/P17270 and previous config saved to /var/cache/conftool/dbconfig/20210914-145522-marostegui.json [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:07] is it Wikibase\Lib\Store\Sql\Terms\DatabaseTermInLangIdsResolver::selectTermsViaJoin ? [14:56:10] (03CR) 10jerkins-bot: [V: 04-1] WIP - helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (owner: 10Elukey) [14:56:33] * Lucas_WMDE listens [14:56:34] that's termstore replacement [14:56:42] I think because cache is cold [14:56:48] (memcached for term store) [14:56:52] mine it is just a guess because it is the only query I am seeing [14:57:04] on the heavy hit servers [14:57:31] jelto _joe_ legoktm we are good to go [14:57:35] maybe for the next dc switch we should "warm it up" beforehand [14:57:35] <_joe_> ack [14:57:42] <_joe_> Amir1: yes definitely [14:57:50] I see a few fetchterms too [14:57:52] we can add some Wikidata URLs to the warmup [14:57:58] so proceed with 08-start-maintenance? [14:58:02] <_joe_> jelto: let's start maintenance yes [14:58:03] legoktm: that'd be nice indeed [14:58:05] jelto: yep [14:58:09] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:16] if we had proportional weighting of RO traffic on the dns discovery record we could get warmups 'for free' [14:58:20] legoktm: it's a common misconception: most of reads on s8 are not coming from wikidata.org [14:58:22] Some Day(tm) [14:58:30] this needs some parsing [14:59:00] <_joe_> cdanis: if we were able to serve RO traffic from both datacenters... [14:59:07] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:59:17] Amir1: "URLs that warmup the relevant Wikidata paths :)" [14:59:18] _joe_: Some Day(tm) [14:59:25] better :D [14:59:39] <_joe_> all metrics look great and healthy AFAICS [15:00:12] yeah from an appserver POV I think this is the cleanest switchover we've done [15:00:15] yeah, db land is looking better now [15:00:19] (03PS1) 10MSantos: maps: disable OSM sync maps2009.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/721000 [15:00:23] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [15:00:24] (03CR) 10Ssingh: [C: 03+1] "I used modules/profile/files/trafficserver/tls.lua as the base comparison." [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:05] <_joe_> cdanis: when we'll have that ability, we won't need warmups anymore though :P [15:01:50] (03PS2) 10Elukey: WIP - helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 [15:01:54] parsoid doesn't seem very happy. 10+% of GETs result in a 5xx [15:02:07] https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=20&orgId=1&from=now-1h&to=now&refresh=1m&forceLogin=true&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200 [15:02:11] <_joe_> akosiaris: is that changed from before? [15:02:17] and by parsoid I mean parsoid-php [15:02:29] <_joe_> don't trust that graph, something is wrong in that count [15:02:45] yeah, it's about twice and even more what we had in codfw 1h ago [15:02:59] I could of course attribute some of it to the aftermath of the switchover [15:03:06] meaning it will quiet down eventually [15:03:07] <_joe_> no it's not, it was around 40% 1 hour ago in codfw [15:03:23] <_joe_> and again, that percentage is wrong [15:03:36] ah 3h ago it was indeed. [15:03:39] sorry, false alarm [15:03:48] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:52] in exception.log it looks like big pages hitting the 60s timeout [15:03:52] <_joe_> we should run step 09 now [15:03:58] <_joe_> yes [15:04:04] <_joe_> it was happening earlier too [15:04:47] ok, I was waiting for WIkidata's dispatch lag to recover, which I think it just did [15:04:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase db1109 load', diff saved to https://phabricator.wikimedia.org/P17271 and previous config saved to /var/cache/conftool/dbconfig/20210914-150458-marostegui.json [15:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:08] everyone fine with 09-restore-ttl? Or should we investigate some more? [15:05:20] +1 from me [15:05:26] (03CR) 10jerkins-bot: [V: 04-1] WIP - helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (owner: 10Elukey) [15:05:30] <_joe_> +1 [15:05:35] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [15:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:52] jelto: before you run 09-run-puppet-on-db-masters let's get an ack from DBAs [15:05:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:05:56] can we disable the banner now? [15:06:09] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [15:06:12] it was until 15:01 UTC, so...autodone now [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:22] ah, great [15:06:42] <_joe_> jelto: go on with the rest of the remaining tasks [15:06:53] ok proceeding [15:06:53] <_joe_> so that kormat can remove the downtime from the db masters [15:07:04] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [15:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:41] (03PS1) 10Vgutierrez: learn.wiki: Add additional records [dns] - 10https://gerrit.wikimedia.org/r/721003 (https://phabricator.wikimedia.org/T290974) [15:10:46] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) [15:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:07] proceeding with 09-update-tendril [15:11:20] <_joe_> +1 [15:11:25] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:27] jelto: sounds good - pcX will fail, I will fix those manually [15:11:28] <_joe_> kormat: I think you can remove the downtime if you want [15:11:36] !log jelto@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-update-tendril [15:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:48] _joe_: not with https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29 looking the way it is right now [15:11:53] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-update-tendril (exit_code=0) [15:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:02] I think the DNS records for the masters need to be updated too [15:12:04] <_joe_> kormat: oh we need to run puppet on icinga [15:12:15] (03CR) 10Vgutierrez: [C: 03+2] learn.wiki: Add additional records [dns] - 10https://gerrit.wikimedia.org/r/721003 (https://phabricator.wikimedia.org/T290974) (owner: 10Vgutierrez) [15:12:36] legoktm: yep, I will take care of that [15:12:40] ty [15:12:50] _joe_: oof, ok. doing. [15:13:10] <_joe_> maybe it should be part of the cookbook :) [15:13:12] tendril looks good, going to fix pc1, pc2 and pc3 manually [15:13:38] _joe_: yeah, we're reworking that whole section including the downtimes, it just didn't land in time for today :) [15:13:50] <_joe_> 14:43:48 -> 14:46:30 [15:14:00] <_joe_> a tad over 2 minutes, pretty ok I'd say [15:15:50] (03CR) 10Jgiannelos: [C: 03+1] maps: disable OSM sync maps2009.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/721000 (owner: 10MSantos) [15:17:38] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Additional DNS config for WikiLearn (urgent, AWS gives us 48 hours for verification) - https://phabricator.wikimedia.org/T290974 (10Vgutierrez) 05Open→03Resolved ` vgutierrez@carrot:~/wikimedia.org/operations/dns/templates$ host -t CNAME _e8216d92d36158dd219... [15:17:47] tendril for parsercache is now fixed [15:18:30] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:19:03] !log kormat@cumin1001 START - Cookbook sre.hosts.remove-downtime for 37 hosts [15:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 37 hosts [15:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:53] the other two manual steps, updating disc_desired_state.py and noc's debug.json I'll take care of in a bit [15:20:06] other than that, are we all done? [15:20:56] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:21:20] done afaict [15:21:31] (03PS1) 10Marostegui: wmnet: Update db masters aliases [dns] - 10https://gerrit.wikimedia.org/r/721008 (https://phabricator.wikimedia.org/T287539) [15:21:47] kormat: for you with <3 ^ [15:22:47] great work everybody, this was super well done [15:23:22] \o/ [15:23:34] shoutout to jelto for running all the coobooks :) [15:23:52] and thanks to everyone else as well, this really did go pretty smoothly [15:24:04] (03PS7) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [15:24:17] thanks for all the support as well! [15:25:00] marostegui: looking, unenthusiastically [15:25:07] <_joe_> yeah it went pretty well, congrats all :) [15:25:19] (03CR) 10Jforrester: "Oops, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720986 (https://phabricator.wikimedia.org/T290640) (owner: 10Legoktm) [15:25:39] <_joe_> esp the dbas, who get most of the heat in the minutes post-switchover [15:27:01] (03CR) 10Kormat: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/721008 (https://phabricator.wikimedia.org/T287539) (owner: 10Marostegui) [15:27:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Update db masters aliases [dns] - 10https://gerrit.wikimedia.org/r/721008 (https://phabricator.wikimedia.org/T287539) (owner: 10Marostegui) [15:27:28] legoktm: ^ aliases merged and deployed [15:27:47] great, I think we're officially done now :) [15:28:28] congratulations 🙂 [15:28:29] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) Hi @ldelench_wmf, this is scheduled as part of our ops work. It will require some review... [15:28:52] congrats everyone. Nicely done! [15:32:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:06] (03PS1) 10Jforrester: Set wgProhibitedFileExtensions not wgFileBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721011 [15:33:08] (03PS1) 10Jforrester: Rename wmfFileBlacklist to wmgProhibitedFileExtensions part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721012 [15:33:10] (03PS1) 10Jforrester: Rename wmfFileBlacklist to wmgProhibitedFileExtensions part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721013 [15:33:12] (03PS1) 10Jforrester: Rename wmfFileBlacklist to wmgProhibitedFileExtensions part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721014 [15:34:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:44] (03CR) 10Cwhite: [C: 03+2] o11y: add rsyslog alerts [alerts] - 10https://gerrit.wikimedia.org/r/720063 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [15:38:49] (03PS2) 10DCausse: elasticsearch: Fix cirrus_settings_check [puppet] - 10https://gerrit.wikimedia.org/r/720908 [15:41:07] (03PS1) 10Ssingh: Add durum hosts durum[123]00[12] to BGP anycast in eqiad, codfw, esams [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) [15:41:23] (03CR) 10Hnowlan: Assume default on single-instance hosts (031 comment) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/719051 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan) [15:41:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:36] (03PS1) 10Hnowlan: Warn when no instance name is passed. [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/721019 (https://phabricator.wikimedia.org/T178169) [15:42:26] (03CR) 10Jdlrobson: Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [15:42:30] (03PS7) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [15:42:37] (03PS8) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [15:42:39] (03Abandoned) 10Hnowlan: Assume default on single-instance hosts [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/719051 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan) [15:43:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1010.eqiad.wmnet [15:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1010.eqiad.wmnet with reason: Resyncing from master [15:53:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1010.eqiad.wmnet with reason: Resyncing from master [15:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:19] (03PS1) 10Ssingh: site: update role for durum[123]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/721021 [15:57:10] (03CR) 10Cwhite: [C: 03+2] o11y: add logstash alerts [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [15:57:14] (03PS5) 10Cwhite: o11y: add logstash alerts [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) [16:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:32] (03PS2) 10Cwhite: logging: clean up legacy logstash alerts [puppet] - 10https://gerrit.wikimedia.org/r/720093 (https://phabricator.wikimedia.org/T288726) [16:00:51] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/720243 (owner: 10Filippo Giunchedi) [16:01:21] (03PS2) 10Krinkle: Early adopt wgIncludejQueryMigrate=false on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720387 (https://phabricator.wikimedia.org/T280944) [16:01:34] RECOVERY - Check systemd state on an-worker1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:14] RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:03:01] (03CR) 10Hnowlan: [C: 03+2] kube_env: Give usage when no arguments are passed [puppet] - 10https://gerrit.wikimedia.org/r/719562 (owner: 10Hnowlan) [16:05:16] (03CR) 10Cwhite: [C: 03+2] logging: clean up legacy logstash alerts [puppet] - 10https://gerrit.wikimedia.org/r/720093 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [16:05:18] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:06:57] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1051 - https://phabricator.wikimedia.org/T290442 (10Cmjohnson) a:05Cmjohnson→03fgiunchedi @fgiunchedi I replaced the failed disk in slot 11, I am not certain how to add the disk back to the raid with HPs. I am assigning to you to finish and resolve the task. THanks [16:07:46] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) @Papaul let me know when you want this off to be powered off and I will have it ready for you. [16:09:00] (03PS1) 10Ssingh: acme_chief: update authorized_regexes for durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) [16:09:50] RECOVERY - HP RAID on ms-be1051 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:10:02] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10dancy) @Joe @Krinkle What's the reason php7-fatal-error.php is in /etc/php (via operations/puppet) and not in operations/mediawiki-config ? [16:10:32] (03PS2) 10Ssingh: acme_chief: update authorized_regexes for durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) [16:10:35] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10Cmjohnson) 05Open→03Resolved Replaced the disk and added back to the array cmjohnson@an-worker1096:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 Adapter 0: Created VD 6 Config... [16:10:45] RECOVERY - MegaRAID on an-worker1096 is OK: OK: optimal, 24 logical, 24 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:11:48] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) I've added these to cloudsw1 cloudcephosd1021 is in ports 3 and 4 and not pxe booting. cloudcephosd1022 is in ports 34 and 25 but has not been set up... [16:11:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31071/console" [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:12:44] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1051 - https://phabricator.wikimedia.org/T290442 (10fgiunchedi) 05Open→03Resolved Sounds good, thank you @Cmjohnson! I've reenabled the logical drive via `ssacli`: ` root@ms-be1051:~# ssacli Smart Storage Administrator CLI 3.30.13.0 Detecting Controllers...Done.... [16:13:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q1): decommission icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T279601 (10Cmjohnson) [16:16:23] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:16:26] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10fgiunchedi) [16:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:29] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:19:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q1): decommission icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T279601 (10Cmjohnson) [16:20:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:58] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: update authorized_regexes for durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [16:24:45] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:13] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Decommission labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10Cmjohnson) [16:27:34] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Decommission labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10Cmjohnson) These servers are well out of warranty, finalized decom [16:28:15] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Decommission labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10Cmjohnson) [16:28:26] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Decommission labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10Cmjohnson) netbox updated, removed from rack [16:28:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:40] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10Ottomata) @cmooney yes, I think an account in analytics-privatedata-users with no ssh keys makes sense here. Approved. [16:29:35] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): Decommission labpuppetmaster1001 and 1002 - https://phabricator.wikimedia.org/T234462 (10Cmjohnson) 05Open→03Resolved [16:30:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q1): decommission icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T279601 (10Cmjohnson) 05Open→03Resolved removed from rack and updated netbox [16:31:04] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721023 (https://phabricator.wikimedia.org/T290985) (owner: 10Michael Große) [16:31:32] (03CR) 10jerkins-bot: [V: 04-1] Enable new dispatch mechanism on a few beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721023 (https://phabricator.wikimedia.org/T290985) (owner: 10Michael Große) [16:31:32] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:56] 10SRE, 10Analytics, 10Data-Engineering, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) > Ottomata, in your last comment you say you're not opposed, however the patch has a -1 on it from you... [16:34:16] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:34:46] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Trizek-WMF) >>! In T287539#7352123, @Redrose64 wrote: > Five minutes ago it said "15:00 UTC - 16:00 UTC". It's now reading "14:00 UTC - 15... [16:37:13] (03PS4) 10Bartosz Dziewoński: DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) [16:37:17] (03PS2) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700675 (https://phabricator.wikimedia.org/T284339) [16:37:27] (03PS2) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700677 (https://phabricator.wikimedia.org/T285162) [16:42:12] (03PS4) 10Michael Große: Enable new dispatch mechanism on a few beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721023 (https://phabricator.wikimedia.org/T290985) [16:43:42] (03PS3) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700675 (https://phabricator.wikimedia.org/T284339) [16:43:44] (03PS3) 10Bartosz Dziewoński: Offer the DiscussionTools reply tool as opt-out setting at ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700677 (https://phabricator.wikimedia.org/T285162) [16:44:37] 10ops-eqiad, 10Analytics, 10DC-Ops, 10Data-Engineering: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10RobH) [16:44:53] 10ops-eqiad, 10Analytics, 10DC-Ops, 10Data-Engineering: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10RobH) [16:46:54] PROBLEM - very high load average likely xfs on ms-be1062 is CRITICAL: CRITICAL - load average: 141.39, 113.35, 79.59 https://wikitech.wikimedia.org/wiki/Swift [16:47:44] jouncebot: nowandnext [16:47:44] For the next 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1600) [16:47:45] In 0 hour(s) and 12 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1700) [16:48:07] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721023 (https://phabricator.wikimedia.org/T290985) (owner: 10Michael Große) [16:49:12] (03Merged) 10jenkins-bot: Enable new dispatch mechanism on a few beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721023 (https://phabricator.wikimedia.org/T290985) (owner: 10Michael Große) [16:50:01] rebased on deploy1002 [16:55:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1700). Please do the needful. [17:00:45] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10Volans) So, after a quick check this is what I found: * this is happening on `ms-be105[1-9]`, `ms-be205[1-6]` and `relforge100[3-4]` * the issue was introduced... [17:07:03] 10SRE, 10MediaWiki-Uploading: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10Inductiveload) The `DBQueryError` bit sounds like T278389 [17:08:10] (03PS1) 10Jforrester: Alter wgMimeTypeExclusions not wgMimeTypeBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721030 [17:09:54] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle) @dancy TLDR: It could probably be moved, and I'll ramble a bit about what I currently understand, some of which you know already, and these may or may not be... [17:11:51] (03PS6) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [17:11:55] (03PS1) 10Volans: facter: fix lldp_neighbors error on empty lldp [puppet] - 10https://gerrit.wikimedia.org/r/721031 (https://phabricator.wikimedia.org/T290984) [17:12:27] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The following units failed: session-196321.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:43] RECOVERY - very high load average likely xfs on ms-be1062 is OK: OK - load average: 60.33, 67.75, 78.87 https://wikitech.wikimedia.org/wiki/Swift [17:13:43] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10Volans) p:05Triage→03Medium [17:13:45] (03PS7) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [17:21:05] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:13] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 233 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:31:57] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 41 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:35:05] (03PS1) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/720858 (https://phabricator.wikimedia.org/T290973) [17:35:16] (03PS1) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/720859 (https://phabricator.wikimedia.org/T290973) [17:36:12] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10Volans) All the affected hosts are HP and seems to have a `Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)` network car... [17:37:40] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10MGA73... [17:37:53] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:45:41] !log reimaging mx2001 to bullseye T286911 [17:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:48] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [17:50:01] PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [17:51:03] moritzm: that's likely you ^^ [17:51:15] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-100595.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:07] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 26 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:52:50] (03CR) 10jerkins-bot: [V: 04-1] Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/720859 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester) [17:53:04] urbanecm: thanks, fixing down time [17:53:11] np [17:53:33] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10cmooney) This may be related to this reported bug. It seems these Intel cards have an on-board LLDP agent, which if enabled cause it to p... [17:54:01] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: reimage [17:54:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: reimage [17:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:08] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10Volans) [17:59:24] jouncebot: now [17:59:24] For the next 0 hour(s) and 0 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1700) [18:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1800). [18:00:05] MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] yoooo [18:00:12] I can deploy today! [18:00:23] (03PS5) 10Urbanecm: DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) (owner: 10Bartosz Dziewoński) [18:00:25] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2992 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:00:26] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) (owner: 10Bartosz Dziewoński) [18:01:14] (03Merged) 10jenkins-bot: DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) (owner: 10Bartosz Dziewoński) [18:03:41] MatmaRex: sorry, took longer than usual, updating muscle memory to use right servers [18:03:48] available for test at mwdebug1001 [18:04:56] (03PS1) 10Urbanecm: Revert "debug.json: List primary DC servers first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721043 [18:05:06] legoktm: if you're around, please review ^^ :) [18:05:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:27] (03CR) 10Legoktm: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721043 (owner: 10Urbanecm) [18:05:32] thanks :) [18:05:35] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:38] MatmaRex: let me know how your patch looks like. [18:05:45] I hope to automate that by next switchover [18:05:47] looking [18:06:05] legoktm: would be cool -- thanks. [18:06:13] (03CR) 10Urbanecm: [C: 03+2] Revert "debug.json: List primary DC servers first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721043 (owner: 10Urbanecm) [18:06:48] cswiki on mwdebug1001 looks as expected [18:06:55] great! [18:06:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:06:59] (03Merged) 10jenkins-bot: Revert "debug.json: List primary DC servers first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721043 (owner: 10Urbanecm) [18:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:26] (brb) [18:07:37] ack [18:09:01] !log urbanecm@deploy1002 Synchronized debug.json: Idef64e72 (duration: 01m 29s) [18:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:21] https://noc.wikimedia.org/conf/debug.json has expected state [18:10:23] (i'm back) [18:11:14] MatmaRex: ack. You said "cswiki ok" -- do you plan to continue testing with the other wiki too? [18:11:16] or should i sync? [18:11:39] oh, i thought you'd sync [18:11:51] okay, syncing :) [18:12:07] i only synced debug.json, to put mwdebug1001 first, as we're now eqiad again :) [18:12:43] arwiki looks good too [18:12:48] great [18:12:49] syncing [18:12:59] (03PS4) 10Urbanecm: Offer the DiscussionTools reply tool as opt-out setting at Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700675 (https://phabricator.wikimedia.org/T284339) (owner: 10Bartosz Dziewoński) [18:13:02] (03CR) 10Urbanecm: [C: 03+2] Offer the DiscussionTools reply tool as opt-out setting at Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700675 (https://phabricator.wikimedia.org/T284339) (owner: 10Bartosz Dziewoński) [18:13:39] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e36f4d3dcc368f0afbce3649ce72f2135ab1c76f: DiscussionTools: Make newtopictool available to everyone on arwiki and cswiki (T285724) (duration: 01m 04s) [18:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:44] T285724: Deploy config change to make New Discussion Tool available as opt-out at partner wikis - https://phabricator.wikimedia.org/T285724 [18:13:51] (03Merged) 10jenkins-bot: Offer the DiscussionTools reply tool as opt-out setting at Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700675 (https://phabricator.wikimedia.org/T284339) (owner: 10Bartosz Dziewoński) [18:14:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:37] MatmaRex: wikimaniawiki is at mwdebug1001, please have a look [18:15:17] seems good [18:15:20] syncing [18:15:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:09] (03PS4) 10Urbanecm: Offer the DiscussionTools reply tool as opt-out setting at ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700677 (https://phabricator.wikimedia.org/T285162) (owner: 10Bartosz Dziewoński) [18:16:11] (03CR) 10Urbanecm: [C: 03+2] Offer the DiscussionTools reply tool as opt-out setting at ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700677 (https://phabricator.wikimedia.org/T285162) (owner: 10Bartosz Dziewoński) [18:16:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7f1de32f4b5788e92291a5448563bc61a9f561e2: Offer the DiscussionTools reply tool as opt-out setting at Wikimania wiki (T284339) (duration: 01m 05s) [18:16:47] synced [18:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:50] T284339: Offer the Reply Tool as opt-out setting at Wikimania wiki - https://phabricator.wikimedia.org/T284339 [18:16:57] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10cmooney) We attempted disabling the NICs own LLDP parser by echoing the command and it seems to ha... [18:16:58] (03Merged) 10jenkins-bot: Offer the DiscussionTools reply tool as opt-out setting at ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700677 (https://phabricator.wikimedia.org/T285162) (owner: 10Bartosz Dziewoński) [18:17:54] MatmaRex: and the last patch is at mwdebug1001 as well [18:18:33] looks good! [18:18:36] syncing [18:20:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2982638039720107d0b6e3227f5dce5b34ce7533: Offer the DiscussionTools reply tool as opt-out setting at ptwikinews (T285162) (duration: 01m 06s) [18:20:03] and, done [18:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:05] T285162: Add DiscussionTools extension in ptwikinews - https://phabricator.wikimedia.org/T285162 [18:20:06] anything else MatmaRex ? [18:20:22] thanks [18:20:24] np [18:20:42] if that's all... [18:20:47] ...let's see what will break [18:20:49] (03PS3) 10Urbanecm: [beta] Enable CentralAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717511 (https://phabricator.wikimedia.org/T205347) [18:20:57] (03CR) 10Urbanecm: [C: 03+2] "let's see how it will work in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717511 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [18:21:39] (03Merged) 10jenkins-bot: [beta] Enable CentralAuth on foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717511 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [18:23:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:59] RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Sun 14 Nov 2021 01:37:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [18:24:01] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:37] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:32:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:40] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Temporarily filter port 25 on mx2001 for reimage" [homer/public] - 10https://gerrit.wikimedia.org/r/720783 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [18:33:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:17] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:57] !log removed filter for tcp/25 on mx2001, reimage is complete T286911 [18:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:03] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [18:58:38] (03CR) 10Legoktm: irc: Split long !log lines (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) (owner: 10Legoktm) [19:00:05] hashar and twentyafterfour: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1900). [19:02:40] twentyafterfour I am around and sending the promote patch :) [19:02:57] fingers crossed :) [19:03:08] well [19:03:26] I'm really hoping there is no CA breakage :P [19:03:33] at least Iam not attempting to land a plane! [19:03:42] CA? [19:03:49] CentralAuth [19:04:48] hashar: ok, I can keep an eye on things also [19:05:33] (03PS1) 10Hashar: group0 wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721046 [19:05:37] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721046 (owner: 10Hashar) [19:05:55] twentyafterfour: +1 [19:06:31] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721046 (owner: 10Hashar) [19:06:45] there are like 3 risky patches: page protection moved out of Title, SyntaxHighlight running pygments from shellbox (should be easy to test) and bunch of CentralAuth changes [19:06:47] (03PS1) 10Dave Pifke: statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [19:07:06] + the usual drama of things that go in the wrong direction [19:07:26] (03CR) 10jerkins-bot: [V: 04-1] statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [19:08:02] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.23 [19:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:18] (03PS2) 10Dave Pifke: statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [19:08:48] (03CR) 10jerkins-bot: [V: 04-1] statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [19:12:33] well that looks quiet [19:13:10] (03PS3) 10Dave Pifke: statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [19:13:24] hashar: maybe you broke logging? :D [19:13:41] (03CR) 10jerkins-bot: [V: 04-1] statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) (owner: 10Dave Pifke) [19:14:46] (03PS4) 10Dave Pifke: statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [19:16:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:50] majavah, urbanecm, DannyS712, Zabe : I don't have any errors so centralauth might be fine. [19:17:00] sounds great! [19:17:54] there are bunch of metrics at https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1&from=now-1h&to=now which looks like they are correct [19:19:01] we're only on group0 now, right? [19:19:12] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-100632.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:16] right [19:19:19] (so i wouldn't expect much to break, only handful of people log in there, i think) [19:19:40] jouncebot: nowandnext [19:19:40] For the next 1 hour(s) and 40 minute(s): MediaWiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T1900) [19:19:40] In 3 hour(s) and 40 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T2300) [19:19:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:09] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) @Krinkle Timo, is what I propose above feasible? [19:28:56] (03PS2) 10Jforrester: Set wgProhibitedFileExtensions not wgFileBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721011 (https://phabricator.wikimedia.org/T290640) [19:29:18] one very minor glitch in ProofreadPage but that code got deployed a few weeks ago and is due to mishandling some erroneous user input - https://phabricator.wikimedia.org/T291005 [19:29:20] so not a blocker [19:30:54] 6 "Transaction spent 3.2702443599701 second(s) in writes, exceeding the limit of 3" [19:31:02] over a second, all for the same page on viwiki [19:31:44] I am ignoring that one [19:32:05] so that is all I guess [19:32:20] twentyafterfour: it is all quiet :] [19:32:57] (03PS1) 10Jforrester: tests: suppress API prefix uniqueness check for 'pi' [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/720862 (https://phabricator.wikimedia.org/T290585) [19:33:50] (03PS2) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/720859 (https://phabricator.wikimedia.org/T290973) [19:35:02] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The following units failed: session-194612.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:21] (03PS3) 10Ryan Kemper: trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:38:42] James_F: there is no wmf.22 D: [19:40:32] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:00] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-100645.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:21] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Krinkle) >>! In T285232#7350176, @dancy wrote: > Can someone give me an example of a curl command that exercises t... [19:59:38] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10CDanis) Because the only data available in the API is the timestamp of the most recent successfully-processed datapoint, this unfortunately left a gap in the data f... [20:01:41] (03PS2) 10Ebernhardson: wcqs: Set admin groups and cluster to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 [20:06:36] !log T290425 ✔️ cdanis@alert1001.wikimedia.org ~ 🕓🍵 sudo /usr/bin/statograph -c /etc/statograph/config.yml erase_metric_data h5mvbny28713 [20:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:43] T290425: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 [20:06:52] !log T290425 ✔️ cdanis@alert1001.wikimedia.org ~ 🕓🍵 sudo /usr/bin/statograph -c /etc/statograph/config.yml erase_metric_data lyfcttm2lhw4 [20:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:38] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) [20:09:26] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:09:47] no [20:12:29] exceptions and fatals seem to come from parsoid [20:12:40] That was acting up earlier today [20:12:44] which seems like "normal" [20:14:44] yeah a spike of [warn] wt2html: listItem limit exceeded from crhwiki [20:15:45] all for the same page [20:15:55] okay, so I quickly deploy a straightforward patch [20:15:55] twentyafterfour: so essentially the train is quiet [20:16:10] https://gerrit.wikimedia.org/r/720387 [20:16:12] so I am off (it is past 10pm here) [20:16:19] (03PS3) 10Ladsgroup: Early adopt wgIncludejQueryMigrate=false on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720387 (https://phabricator.wikimedia.org/T280944) (owner: 10Krinkle) [20:16:22] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:47] (03CR) 10Ladsgroup: [C: 03+2] Early adopt wgIncludejQueryMigrate=false on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720387 (https://phabricator.wikimedia.org/T280944) (owner: 10Krinkle) [20:18:35] (03Merged) 10jenkins-bot: Early adopt wgIncludejQueryMigrate=false on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720387 (https://phabricator.wikimedia.org/T280944) (owner: 10Krinkle) [20:18:52] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 46 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:20:50] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:720387|Early adopt wgIncludejQueryMigrate=false on nlwiki (T280944)]] (duration: 01m 48s) [20:20:55] !log testing upcoming Scap release on beta [20:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:59] T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944 [20:21:00] MatmaRex: Oh, hah, fair, it was branched but not deployed. [20:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:39] (03PS1) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/720865 (https://phabricator.wikimedia.org/T290973) [20:23:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:31] (03Abandoned) 10Jforrester: tests: suppress API prefix uniqueness check for 'pi' [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/720862 (https://phabricator.wikimedia.org/T290585) (owner: 10Jforrester) [20:26:39] (03Abandoned) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/720859 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester) [20:26:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:57] (03CR) 10jerkins-bot: [V: 04-1] Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/720865 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester) [20:40:48] (03PS1) 10Jforrester: tests: suppress API prefix uniqueness check for 'pi' [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721066 (https://phabricator.wikimedia.org/T290585) [20:41:02] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:04] (03PS2) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/720865 (https://phabricator.wikimedia.org/T290973) [20:46:12] (03PS3) 10Ryan Kemper: wcqs: Set admin groups and cluster to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [20:48:10] (03PS42) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [20:48:47] (03PS1) 10Ottomata: statistics::rsync::published - push to analytics-web.discovery.wmnet cname [puppet] - 10https://gerrit.wikimedia.org/r/721061 (https://phabricator.wikimedia.org/T285355) [20:51:01] (03PS1) 10Ottomata: Point trafficserver at analytics-web cname instead of thorium hostname [puppet] - 10https://gerrit.wikimedia.org/r/721062 (https://phabricator.wikimedia.org/T285355) [20:52:05] (03CR) 10Ryan Kemper: [C: 03+1] "Looks good to me, relying on the CNAME should make the cutover process much smoother" [puppet] - 10https://gerrit.wikimedia.org/r/721061 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [20:55:34] (03CR) 10Ryan Kemper: [C: 03+1] "This seems like it should work fine given that the trafficserver is talking to the internal hostname already, but like with all things tra" [puppet] - 10https://gerrit.wikimedia.org/r/721062 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [21:06:38] (03PS43) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [21:07:48] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:52] 10SRE, 10MediaWiki-File-management, 10MediaWiki-extensions-VipsScaler: MediaWiki refuses to generate thumbnail URLs for large PNGs and TIFFs with "Error creating thumbnail: File with dimensions greater than 100 MP" - https://phabricator.wikimedia.org/T291010 (10AntiCompositeNumber) [21:08:31] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31074/console" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [21:10:00] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [21:13:38] (03CR) 10BBlack: [C: 03+1] "Yes, I think that should work as intended!" [puppet] - 10https://gerrit.wikimedia.org/r/721062 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [21:15:48] 10SRE, 10Commons, 10MediaWiki-File-management, 10MediaWiki-extensions-VipsScaler: MediaWiki refuses to generate thumbnail URLs for large PNGs and TIFFs with "Error creating thumbnail: File with dimensions greater than 100 MP" - https://phabricator.wikimedia.org/T291010 (10AntiCompositeNumber) [21:16:40] (03PS4) 10Ryan Kemper: wcqs: Set admin groups and cluster to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [21:17:26] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [21:24:34] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: Set admin groups and cluster to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [21:25:16] (03CR) 10Ryan Kemper: [C: 03+1] wcqs: Set admin groups and cluster to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [21:26:24] (03PS5) 10Ryan Kemper: wcqs: Set admin groups to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [21:27:56] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: Set admin groups to match wdqs [puppet] - 10https://gerrit.wikimedia.org/r/719643 (owner: 10Ebernhardson) [21:28:09] (03PS44) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [21:31:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31075/console" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [21:31:26] (03PS1) 10Ebernhardson: Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 [21:31:57] (03CR) 10Btullis: [V: 03+1] "The latest PCC run looks correct, although I cannot check the output of the file templates." [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [21:37:01] (03CR) 10Btullis: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [21:39:58] (03PS1) 10Bstorm: cloud lvm: add back an optional labs_lvm init [puppet] - 10https://gerrit.wikimedia.org/r/721090 (https://phabricator.wikimedia.org/T277078) [21:42:04] hi qchris [21:43:42] (03CR) 10Bstorm: "If we merge this without applying it to any roles, we can easily test it by applying the profile directly to a node in horizon. I suspect " [puppet] - 10https://gerrit.wikimedia.org/r/721090 (https://phabricator.wikimedia.org/T277078) (owner: 10Bstorm) [21:49:35] 10SRE, 10Commons, 10MediaWiki-File-management, 10MediaWiki-extensions-VipsScaler: MediaWiki refuses to generate thumbnail URLs for large PNGs and TIFFs with "Error creating thumbnail: File with dimensions greater than 100 MP" - https://phabricator.wikimedia.org/T291010 (10Legoktm) OK, back to Transformatio... [21:51:32] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-100695.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:33] 10SRE, 10Commons, 10MediaWiki-File-management, 10MediaWiki-extensions-VipsScaler: MediaWiki refuses to generate thumbnail URLs for large PNGs and TIFFs with "Error creating thumbnail: File with dimensions greater than 100 MP" - https://phabricator.wikimedia.org/T291010 (10Legoktm) p:05Triage→03Unbreak!... [21:55:26] 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Jdforrester-WMF) [21:57:38] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: session-196497.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:59] (03PS1) 10Urbanecm: Revert "Add throttle rule for Czech wiki course" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721069 [22:05:11] (03PS1) 10Legoktm: Re-add VipsScaler [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721092 (https://phabricator.wikimedia.org/T290759) [22:05:22] jouncebot: nowandnext [22:05:22] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [22:05:22] In 0 hour(s) and 54 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T2300) [22:05:48] (03CR) 10Jforrester: [C: 03+1] Re-add VipsScaler [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721092 (https://phabricator.wikimedia.org/T290759) (owner: 10Legoktm) [22:05:56] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Re-add VipsScaler [core] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721092 (https://phabricator.wikimedia.org/T290759) (owner: 10Legoktm) [22:06:28] will likely need a full scap dance :/ [22:06:48] Will definitely need that. [22:07:07] yeah, it's because of the special page [22:07:09] First ever time we've back-ported an extension's existence into production. [22:07:25] Well, the special page itself doesn't need to work, of course. [22:07:38] (03PS1) 10Legoktm: Revert "Undeploy VipsScaler: IV – Don't load the i18n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721070 [22:08:40] iirc if the special page is registered but i18n is missing, everything blows up [22:08:50] (03CR) 10Legoktm: [C: 03+2] Revert "Undeploy VipsScaler: IV – Don't load the i18n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721070 (owner: 10Legoktm) [22:09:51] (03Merged) 10jenkins-bot: Revert "Undeploy VipsScaler: IV – Don't load the i18n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721070 (owner: 10Legoktm) [22:10:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:00] !log legoktm@deploy1002 Started scap: Rebuild i18n for redeployment of VipsScaler (T290759) [22:11:02] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:05] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [22:12:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:13] (03PS1) 10Ryan Kemper: wcqs: refactor scap_target [puppet] - 10https://gerrit.wikimedia.org/r/721094 [22:16:53] (03CR) 10jerkins-bot: [V: 04-1] wcqs: refactor scap_target [puppet] - 10https://gerrit.wikimedia.org/r/721094 (owner: 10Ryan Kemper) [22:29:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:36] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:05] oh, we probably need to flip the httpbb thing now [22:30:38] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:33:25] (03PS1) 10Legoktm: httpbb::hourly_tests: Move to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/721097 [22:34:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:50] !log legoktm@deploy1002 Finished scap: Rebuild i18n for redeployment of VipsScaler (T290759) (duration: 23m 49s) [22:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:54] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [22:35:58] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31078/console" [puppet] - 10https://gerrit.wikimedia.org/r/721097 (owner: 10Legoktm) [22:37:34] (03PS1) 10Legoktm: Revert "Undeploy VipsScaler: I, II & III" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721098 (https://phabricator.wikimedia.org/T290759) [22:39:23] James_F: ^ want to double check? I'll sync IS and then CS. But I just squashed all 3 reverts into one for simplicity [22:40:20] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:01] (03CR) 10Legoktm: [V: 03+2 C: 03+2] httpbb::hourly_tests: Move to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/721097 (owner: 10Legoktm) [22:42:42] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:43:01] (03CR) 10Bstorm: [C: 03+2] "After discussing a bit with Andrew, merging this to test directly on a cloud instance using the same flavor config as the integration host" [puppet] - 10https://gerrit.wikimedia.org/r/721090 (https://phabricator.wikimedia.org/T277078) (owner: 10Bstorm) [22:43:50] !log legoktm@cumin2001:~$ sudo systemctl reset-failed # clear httpbb_hourly_tests failure, moved to cumin1001 [22:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:06] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:32] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:46:47] Hmm. [22:47:45] legoktm: Should be OK. I’d prefer to have done the revert of part I separately. [22:48:17] thanks, and why? [22:48:26] (03CR) 10Legoktm: [C: 03+2] Revert "Undeploy VipsScaler: I, II & III" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721098 (https://phabricator.wikimedia.org/T290759) (owner: 10Legoktm) [22:49:16] So that the load ramp up is slow rather than instant. [22:49:17] (03Merged) 10jenkins-bot: Revert "Undeploy VipsScaler: I, II & III" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721098 (https://phabricator.wikimedia.org/T290759) (owner: 10Legoktm) [22:49:39] In case something has changed or is missing and fatals a bunch. [22:50:24] (03CR) 10Ryan Kemper: "we thought we needed to do something like this to get things working, but our problem was actually elsewhere. we still want to do somethin" [puppet] - 10https://gerrit.wikimedia.org/r/721094 (owner: 10Ryan Kemper) [22:50:34] (03Abandoned) 10Ryan Kemper: wcqs: refactor scap_target [puppet] - 10https://gerrit.wikimedia.org/r/721094 (owner: 10Ryan Kemper) [22:53:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:34] on mwdebug1001 [22:55:37] (03PS1) 10Ryan Kemper: wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 [22:55:48] hmm, AntiComposite, do you know how to purge a thumbnail? [22:56:07] (03CR) 10jerkins-bot: [V: 04-1] wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [22:56:17] try ?action=purge on the file description page [22:56:45] not working :/ [22:57:01] using https://commons.wikimedia.org/wiki/File:107_DeGroat_Street_(House),_107_DeGroat_Street,_La_Grange,_Troup_County,_GA_HAER_GA,143-LAGR,23-_(sheet_1_of_1).png with mwdebug1001 enabled [22:57:20] no wait I'm dumb [22:57:40] (03PS2) 10Ryan Kemper: wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 [22:57:56] there we go [22:58:09] (03CR) 10jerkins-bot: [V: 04-1] wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [22:58:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:06] (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [22:59:13] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [22:59:39] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Re-enable VipsScaler (1 of 2) (duration: 01m 05s) [22:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210914T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:01:28] (03PS3) 10Ebernhardson: Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:01:28] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Re-enable VipsScaler (2 of 2) (duration: 01m 04s) [23:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:03] (03CR) 10jerkins-bot: [V: 04-1] Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:03:02] 10SRE, 10Commons, 10MediaWiki-File-management, 10MediaWiki-extensions-VipsScaler: MediaWiki refuses to generate thumbnail URLs for large PNGs and TIFFs with "Error creating thumbnail: File with dimensions greater than 100 MP" - https://phabricator.wikimedia.org/T291010 (10Legoktm) 05Open→03Resolved a:... [23:03:04] (03PS4) 10Ebernhardson: Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:03:38] (03CR) 10jerkins-bot: [V: 04-1] Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:04:46] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:05:20] (03PS5) 10Ebernhardson: Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:05:56] (03CR) 10jerkins-bot: [V: 04-1] Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:06:08] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:06:33] (03PS6) 10Ebernhardson: wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:07:09] (03CR) 10jerkins-bot: [V: 04-1] wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:07:44] (03PS7) 10Ebernhardson: wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:08:21] (03CR) 10jerkins-bot: [V: 04-1] wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:08:44] (03PS8) 10Ebernhardson: wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:09:06] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:09:15] (03CR) 10jerkins-bot: [V: 04-1] wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:12:26] (03PS9) 10Ebernhardson: wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:13:02] (03CR) 10jerkins-bot: [V: 04-1] wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:20:55] (03CR) 10Legoktm: [C: 03+2] mailman: Remove mailman2 config file [puppet] - 10https://gerrit.wikimedia.org/r/720374 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:21:42] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-100720.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:47] (03PS10) 10Ebernhardson: wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:26:49] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:28:14] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [23:34:06] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [23:35:02] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: tell puppet solver we need this vars.yaml [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [23:37:21] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10wiki_willy) a:03Papaul Just a heads up - Papaul is on paternity leave for a couple weeks, but let me know if this becomes urgent and we need to involve smart hands on anything. Thanks, Willy [23:40:34] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:27] (03PS1) 10Ebernhardson: create common location for query service vars [puppet] - 10https://gerrit.wikimedia.org/r/721102 [23:53:24] (03PS2) 10Ebernhardson: create common location for query service vars [puppet] - 10https://gerrit.wikimedia.org/r/721102 [23:53:59] (03CR) 10jerkins-bot: [V: 04-1] create common location for query service vars [puppet] - 10https://gerrit.wikimedia.org/r/721102 (owner: 10Ebernhardson) [23:55:48] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01222 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:57:59] (03PS3) 10Ebernhardson: create common location for query service vars [puppet] - 10https://gerrit.wikimedia.org/r/721102