[00:00:01] Jdlrobson: no problem, thanks for doing the hard work -- I just push the buttons :) [00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T0000). [00:01:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:15] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [00:34:29] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [00:36:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:36:35] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [00:38:25] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [00:38:35] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:44:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:46:53] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:48:49] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [00:52:43] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [01:08:29] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [01:12:25] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [01:16:23] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [01:16:32] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) [01:20:21] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [01:32:14] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) 05Open→03Resolved All packages listed have been uploaded, I ran core's parser tests and PHPUnit tests against the new PHP 7.4 packages and all passed. Note that for everything, you need to... [01:37:59] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [01:39:57] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [01:43:59] (03CR) 10Jeena Huneidi: [C: 03+2] "agreed to merge via slack" [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/732454 (owner: 10Jeena Huneidi) [01:44:42] (03Merged) 10jenkins-bot: Update wikiversions-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/732454 (owner: 10Jeena Huneidi) [02:30:17] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:34:43] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [02:36:47] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [02:57:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:01:27] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:14:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:16:13] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:30:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:30:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:32:55] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:39:01] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:53:37] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [03:53:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:57:11] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:57:47] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [03:57:57] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:00:09] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [04:01:41] 10SRE, 10Discovery-Search, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) p:05Low→03High [04:01:47] 10SRE, 10Discovery-Search, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) lists.wm.o is serving both the old and new cert, just like the blog post mentioned: {P17556} [04:04:02] !log restarted apache on lists1001 so it only uses new TLS cert (T293826) [04:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:09] T293826: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 [04:04:19] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [04:06:06] 10SRE, 10Discovery-Search, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) After restarting apache, and ~50 requests later, I'm only getting the new certificate. Marking as high because the monitoring is pic... [04:16:00] 10SRE, 10Discovery-Search, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) Looking in syslog doesn't show much interesting AFAICT, except it seems like puppet/acme-chief is reloading apache every 2-3 days (n... [04:32:51] (03PS1) 10Marostegui: Revert "db2112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732339 [04:37:15] !log Deploy schema change on s6 codfw - T291719 [04:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:22] T291719: Remove abuse_filter_log.afl_filter column and adjust schema consequently from Wikimedia production - https://phabricator.wikimedia.org/T291719 [04:41:33] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [04:43:33] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [04:47:14] !log Deploy schema change on s5 codfw - T291719 [04:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:20] T291719: Remove abuse_filter_log.afl_filter column and adjust schema consequently from Wikimedia production - https://phabricator.wikimedia.org/T291719 [04:48:23] (03PS2) 10Marostegui: Revert "db2112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732339 [05:14:49] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [05:18:59] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [05:19:02] (03CR) 10Marostegui: [C: 03+2] Revert "db2112: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732339 (owner: 10Marostegui) [05:27:44] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/732472 (https://phabricator.wikimedia.org/T290868) [05:28:56] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/732472 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [05:32:41] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/732340 [05:33:31] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/732340 (owner: 10Marostegui) [05:43:11] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [05:45:11] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [06:09:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Handle non dict YAML documents as well [software/service-checker] - 10https://gerrit.wikimedia.org/r/717249 (owner: 10Alexandros Kosiaris) [06:13:11] (03Merged) 10jenkins-bot: Handle non dict YAML documents as well [software/service-checker] - 10https://gerrit.wikimedia.org/r/717249 (owner: 10Alexandros Kosiaris) [06:19:32] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [06:22:46] (03PS1) 10Elukey: hive: bump hive server's heap settings to 10g [puppet] - 10https://gerrit.wikimedia.org/r/732573 [06:23:20] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [06:29:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "LGTM; merging the chart now, we will need to create a production deployment now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/726933 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [06:34:02] (03Merged) 10jenkins-bot: apple-search: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/726933 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [06:34:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/731917 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [06:35:55] !log `systemctl reload nginx` on cloudelastic100[5,6] to pick up the new TLS certificate and clear alerts - T293826 [06:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:03] T293826: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 [06:36:14] (03CR) 10Ayounsi: "One small ordering issue then it's all good to me!" [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [06:38:28] (03Merged) 10jenkins-bot: mwdebug: fix statsd network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/731917 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [06:41:54] (03PS3) 10Majavah: cr-cloud: add tls ports for openstack services [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) [06:42:12] (03CR) 10Majavah: cr-cloud: add tls ports for openstack services (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [06:45:25] (03PS1) 10Marostegui: Revert "db1118: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732341 [06:47:13] (03CR) 10Marostegui: [C: 03+2] Revert "db1118: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732341 (owner: 10Marostegui) [06:48:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17558 and previous config saved to /var/cache/conftool/dbconfig/20211021-064812-root.json [06:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:49] (03PS1) 10Elukey: Absent the Analytics hdfs-cleaner-gobblin timer [puppet] - 10https://gerrit.wikimedia.org/r/732610 (https://phabricator.wikimedia.org/T287084) [06:55:08] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks Luca" [puppet] - 10https://gerrit.wikimedia.org/r/732610 (https://phabricator.wikimedia.org/T287084) (owner: 10Elukey) [06:55:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31788/console" [puppet] - 10https://gerrit.wikimedia.org/r/732610 (https://phabricator.wikimedia.org/T287084) (owner: 10Elukey) [06:55:50] (03CR) 10Elukey: [V: 03+1 C: 03+2] Absent the Analytics hdfs-cleaner-gobblin timer [puppet] - 10https://gerrit.wikimedia.org/r/732610 (https://phabricator.wikimedia.org/T287084) (owner: 10Elukey) [07:00:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "LGTM, let's merge this first version and we can iterate on it if needed." [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/723663 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [07:01:45] (03Merged) 10jenkins-bot: Minimal version of the image catalog [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/723663 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [07:03:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17559 and previous config saved to /var/cache/conftool/dbconfig/20211021-070315-root.json [07:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:27] (03CR) 10Ayounsi: "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/732298 (https://phabricator.wikimedia.org/T265435) (owner: 10Filippo Giunchedi) [07:07:13] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [07:08:20] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:10:32] (03PS4) 10Majavah: cr-cloud: add tls ports for openstack services [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) [07:18:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17560 and previous config saved to /var/cache/conftool/dbconfig/20211021-071819-root.json [07:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:33] (03PS1) 10Elukey: tlsproxy::localssl: acme_chief should notify nginx [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) [07:24:37] (03CR) 10Elukey: "Hey Valentin! No idea if this is truly right or not, I see the same pattern applied for other services. I am a little unsure if the notify" [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [07:28:04] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 94.79% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:29:19] 10SRE, 10Discovery-Search, 10Traffic, 10observability, 10Patch-For-Review: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10elukey) Maybe it is totally off, but I saw that the cloudelastic nodes use the `tlsproxy::localssl` define (via `elasti... [07:31:03] (03PS1) 10Jbond: C:package_builder: drop superfluous dependencies and pre_condition [puppet] - 10https://gerrit.wikimedia.org/r/732613 (https://phabricator.wikimedia.org/T293912) [07:31:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31789/console" [puppet] - 10https://gerrit.wikimedia.org/r/732613 (https://phabricator.wikimedia.org/T293912) (owner: 10Jbond) [07:32:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:package_builder: drop superfluous dependencies and pre_condition [puppet] - 10https://gerrit.wikimedia.org/r/732613 (https://phabricator.wikimedia.org/T293912) (owner: 10Jbond) [07:33:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17561 and previous config saved to /var/cache/conftool/dbconfig/20211021-073323-root.json [07:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:03] (03CR) 10Jbond: [C: 04-1] pbuilder: test edit for T293912 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [07:35:16] (03CR) 10Jbond: package_builder: Add hook for building PHP 7.4 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732097 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [07:36:02] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: package_builder puppet tests failing - https://phabricator.wikimedia.org/T293912 (10jbond) 05Open→03Resolved This was caused by a pre_condition in the rspec test, fixed now [07:39:23] (03CR) 10Jbond: cumin: add an alias for new pki roles and add to misc-others (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [07:41:33] (03CR) 10Jbond: [C: 03+1] global: remove all "filtertags" lines [puppet] - 10https://gerrit.wikimedia.org/r/732439 (owner: 10Dzahn) [07:48:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17562 and previous config saved to /var/cache/conftool/dbconfig/20211021-074826-root.json [07:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:07] (03CR) 10Filippo Giunchedi: [C: 04-1] "I agree it shouldn't have gotten to this point, profile::puppetmaster::pontoon already enforces enable_geoip => false. It might have been " [puppet] - 10https://gerrit.wikimedia.org/r/732405 (owner: 10Dzahn) [07:56:01] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host graphite1004.eqiad.wmnet with OS bullseye [07:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:14] (03PS48) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [08:00:26] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17563 and previous config saved to /var/cache/conftool/dbconfig/20211021-080330-root.json [08:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:44] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:22] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: add snmp v3 dummy section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732298 (https://phabricator.wikimedia.org/T265435) (owner: 10Filippo Giunchedi) [08:19:43] (03CR) 10Elukey: "For context:" [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [08:20:44] (03PS3) 10Elukey: api-gateway: allow HTTP host header rewrite for discovery endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) [08:21:26] (03PS4) 10Elukey: hemlfile.d: add the inference service to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/730965 (https://phabricator.wikimedia.org/T288789) [08:24:23] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host graphite1004.eqiad.wmnet with OS bullseye [08:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:32] !log cp3062: revert vsl_space experiment T293879 [08:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:37] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [08:26:03] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3062.esams.wmnet,service=(varnish-fe|ats-tls) [08:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:09] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3062.esams.wmnet,service=(varnish-fe|ats-tls) [08:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:19] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: don't fail on new DC [cookbooks] - 10https://gerrit.wikimedia.org/r/732373 (owner: 10Volans) [08:35:31] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10cmooney) @papaul @dzahn I had a go at enumerating the iDrac firmware version on o... [08:35:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/731920 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [08:36:26] (03PS1) 10Ayounsi: Remove GRE tunnel between cr4-ulsfo and cr2-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) [08:37:23] (03CR) 10Ayounsi: "Example diff on the ulsfo side:" [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [08:38:16] (03Merged) 10jenkins-bot: sre.hosts.reimage: don't fail on new DC [cookbooks] - 10https://gerrit.wikimedia.org/r/732373 (owner: 10Volans) [08:43:10] (03CR) 10MMandere: [C: 03+2] resolving: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/731920 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [08:49:15] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.0.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/732617 [08:50:57] jouncebot: nowandnext [08:50:57] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [08:50:57] In 1 hour(s) and 9 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1000) [08:56:18] !log urbanecm@deploy1002 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 01m 03s) [08:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:46] * urbanecm done [08:57:59] (03CR) 10Btullis: [C: 03+1] "Yes, seems very sensible. I can confirm the spiky pattern in the heap usage that you mention and have checked that both an-coord100[1-2] r" [puppet] - 10https://gerrit.wikimedia.org/r/732573 (owner: 10Elukey) [09:00:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 183): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31790/console" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:01:25] (03CR) 10Btullis: [C: 03+2] Remove the alluxio user and group [puppet] - 10https://gerrit.wikimedia.org/r/732296 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [09:14:22] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.0.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/732617 (owner: 10Volans) [09:15:56] (03CR) 10Jbond: [V: 03+1] "PCC cloud: https://puppet-compiler.wmflabs.org/compiler1003/31791/" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:16:18] (03CR) 10Jbond: [V: 03+1] "Follow up actions" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:17:55] 10SRE, 10Observability-Logging, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) By giving a very large amount - 3G instead of the default 80M - of `vsl_space` to cp3062, the issue happens less often but still... [09:19:57] (03PS49) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:20:14] (03CR) 10Jbond: "PCC beta: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31792/" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:20:16] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.0.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/732617 (owner: 10Volans) [09:32:02] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Should we maybe hold off a week or so just to make sure the new transport is stable?" [homer/public] - 10https://gerrit.wikimedia.org/r/732616 (https://phabricator.wikimedia.org/T273308) (owner: 10Ayounsi) [09:32:47] (03PS1) 10Majavah: wstypes.python: Revert version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732619 [09:33:45] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [09:33:47] (03CR) 10Majavah: [C: 03+2] wstypes.python: Revert version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732619 (owner: 10Majavah) [09:34:32] (03Merged) 10jenkins-bot: wstypes.python: Revert version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732619 (owner: 10Majavah) [09:35:22] (03PS1) 10Majavah: d/changelog: Prepare for 0.79 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732620 [09:36:06] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.79 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732620 (owner: 10Majavah) [09:37:26] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.79 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732620 (owner: 10Majavah) [09:38:36] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:38:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:42:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr-cloud: add tls ports for openstack services [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:48:20] (03PS1) 10Filippo Giunchedi: fail on yml files [alerts] - 10https://gerrit.wikimedia.org/r/732621 [09:48:47] (03PS1) 10Giuseppe Lavagetto: admin: temporarily revoke ladsgroup's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/732622 [09:49:36] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:32] (03CR) 10jerkins-bot: [V: 04-1] fail on yml files [alerts] - 10https://gerrit.wikimedia.org/r/732621 (owner: 10Filippo Giunchedi) [09:51:18] <_joe_> the one time -1 is what you hope for :P [09:52:14] hehehe indeed [09:54:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: temporarily revoke ladsgroup's ssh key for production [puppet] - 10https://gerrit.wikimedia.org/r/732622 (owner: 10Giuseppe Lavagetto) [09:55:26] (03PS8) 10Jgiannelos: Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) [09:56:36] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:57:23] (03PS50) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:58:19] (03PS51) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:59:35] (03PS4) 10Elukey: api-gateway: generalize pathing_map [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) [10:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1000). [10:00:28] (03PS1) 10Btullis: Correct the team-data-engineering file names [alerts] - 10https://gerrit.wikimedia.org/r/732623 (https://phabricator.wikimedia.org/T293399) [10:02:30] (03CR) 10jerkins-bot: [V: 04-1] Correct the team-data-engineering file names [alerts] - 10https://gerrit.wikimedia.org/r/732623 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [10:03:49] (03PS1) 10Arturo Borrero Gonzalez: cloud: drop NAT exceptions (dmz_cidr) for wiki-replicas [puppet] - 10https://gerrit.wikimedia.org/r/732628 (https://phabricator.wikimedia.org/T293897) [10:09:47] (03PS1) 10Arturo Borrero Gonzalez: cr-cloud/ drop firewall exception for wiki-replicas [homer/public] - 10https://gerrit.wikimedia.org/r/732633 (https://phabricator.wikimedia.org/T293897) [10:11:09] (03PS2) 10Arturo Borrero Gonzalez: cr-cloud/ drop firewall exception for wiki-replicas [homer/public] - 10https://gerrit.wikimedia.org/r/732633 (https://phabricator.wikimedia.org/T293897) [10:11:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: drop NAT exceptions (dmz_cidr) for wiki-replicas [puppet] - 10https://gerrit.wikimedia.org/r/732628 (https://phabricator.wikimedia.org/T293897) (owner: 10Arturo Borrero Gonzalez) [10:14:02] (03CR) 10Ayounsi: [C: 03+1] cr-cloud/ drop firewall exception for wiki-replicas [homer/public] - 10https://gerrit.wikimedia.org/r/732633 (https://phabricator.wikimedia.org/T293897) (owner: 10Arturo Borrero Gonzalez) [10:14:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr-cloud/ drop firewall exception for wiki-replicas [homer/public] - 10https://gerrit.wikimedia.org/r/732633 (https://phabricator.wikimedia.org/T293897) (owner: 10Arturo Borrero Gonzalez) [10:14:32] !log mergeing refactor of P:base Gerrit:714975 [10:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:38] (03CR) 10Jbond: P:base: move production specific code to their own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:14:41] (03CR) 10Jbond: [C: 03+2] P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:19:02] (03CR) 10Jbond: [C: 03+2] P:base: move production specific code to their own profile (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:19:08] (03PS5) 10Elukey: api-gateway: generalize pathing_map [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) [10:20:18] (03CR) 10Elukey: "Hugh: tried to generalize pathing_map, lemme know your thoughts!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [10:20:58] (03CR) 10Filippo Giunchedi: [C: 03+1] fail on yml files [alerts] - 10https://gerrit.wikimedia.org/r/732621 (owner: 10Filippo Giunchedi) [10:21:03] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] fail on yml files [alerts] - 10https://gerrit.wikimedia.org/r/732621 (owner: 10Filippo Giunchedi) [10:29:03] jouncebot: nowandnext [10:29:03] For the next 0 hour(s) and 30 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1000) [10:29:04] In 0 hour(s) and 30 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1100) [10:29:15] ohh, a training? [10:29:32] <_joe_> that's some euphemism [10:29:50] <_joe_> "bootcamp" or "torture" would be more descriptive of the process [10:30:52] PROBLEM - Disk space on ms-be2028 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [10:31:41] (03PS1) 10MMandere: prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) [10:40:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) a:05ayounsi→03aborrero Thanks for the doc, some follow up questions to make sure I understand it properly. > However,... [10:41:46] PROBLEM - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:41:48] ACKNOWLEDGEMENT - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T294001 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:41:58] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10ops-monitoring-bot) [10:43:01] (03PS1) 10Jbond: P:base::production: update production roles to use P::base::production [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) [10:43:03] (03PS1) 10Jbond: P:base::labs: update labs base profile to include profile::base [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) [10:43:29] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:43:40] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:45:36] (03PS1) 10Jbond: P:idp::standalon: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/732640 [10:46:48] (03CR) 10Jbond: [C: 03+2] P:idp::standalon: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/732640 (owner: 10Jbond) [10:47:57] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [10:47:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [10:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:10] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [10:48:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [10:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:44] PROBLEM - Device not healthy -SMART- on ms-be2028 is CRITICAL: cluster=swift device=None instance=ms-be2028 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [10:58:06] (03PS2) 10Jbond: P:base::production: update production roles to use P::base::production [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) [10:59:54] (03PS2) 10Jbond: P:base::labs: update labs base profile to include profile::base [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) [11:00:05] Amir1, Lucas_WMDE, and apergos: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1100). [11:00:05] nemo-yiannis and Lucas_WMDE: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:27] there are no trainees signed up today, there are two patches in the window, one of which does not have a gerrit changeset linked [11:00:30] * apergos looks at Lucas_WMDE [11:00:36] (03PS1) 10Giuseppe Lavagetto: php-fpm: Allow changing location of the log files [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732641 (https://phabricator.wikimedia.org/T288851) [11:00:38] (03PS1) 10Giuseppe Lavagetto: Add php 7.4 on buster images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732642 (https://phabricator.wikimedia.org/T293996) [11:00:40] yeah it’s a bit unconventional [11:00:45] mainly a reminder to myself I guess ^^ [11:00:56] nemo-yiannis: do you want to self-serve your config change? [11:01:09] let's see if nemo-yiannis is around [11:01:11] yeah i can merge and deploy [11:01:15] ah very good [11:01:58] ok [11:02:02] (03CR) 10Jgiannelos: [C: 03+2] Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [11:02:06] then I’ll wait for you before running my maintenance script [11:02:37] sounds good [11:02:51] I have one user report of being unable to connect to Wikidata btw (ERR_ADDRESS_UNREACHABLE) [11:03:00] is there a task you can link from your entry on the calendar, Lucas_WMDE? [11:03:02] I’ve pointed them to the “reporting a connectivity issue” page, let’s see if that leads to something [11:03:09] apergos: I don’t think so but let me ask [11:03:18] (03Merged) 10jenkins-bot: Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [11:04:31] (no patchset link and no task link means anyone later wondering why that was run will have no idea, that's why I ask) [11:04:31] (03PS1) 10Jbond: P:standard: drop profile::standard as its no longer used [puppet] - 10https://gerrit.wikimedia.org/r/732644 [11:05:18] apergos: yeah, created https://phabricator.wikimedia.org/T294008 [11:05:42] perfect!! 👍 [11:06:06] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:27] !log jgiannelos@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:730848|Configure event stream for map tiles state change (T289771)]] (duration: 01m 04s) [11:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:32] T289771: Add kafka support for tile-pregeneration events - https://phabricator.wikimedia.org/T289771 [11:07:39] (03PS1) 10Volans: Upstream release v1.0.6 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/732647 [11:08:04] Lucas_WMDE: I am done with deploying my patch [11:08:10] ack, thank you [11:08:13] * Lucas_WMDE SSHs into mwmaint [11:09:04] * nemo-yiannis waits for the canary events from eventplatform [11:09:59] whre are you looking for those? [11:10:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:36] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/ResubmitChanges.php wikidatawiki --minimum-age $((60*60*12)) # T294008 [11:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:41] T294008: Run ResubmitChanges.php to resubmit stuck change - https://phabricator.wikimedia.org/T294008 [11:10:55] apergos: i am using kafkacat in one of the maps nodes [11:11:17] oh! never poked around in there.... gtk [11:11:30] Also it looks like kafka topics are created so we should be OK [11:11:40] (03PS3) 10Jbond: P:base::labs: update labs base profile to include profile::base [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) [11:11:48] I’m done with my maintenance script [11:12:59] so... that's the end of the window? darn fast today [11:13:07] not even 15 minutes! [11:13:42] I guess so [11:13:51] !log UTC morning backport+config window done [11:13:53] did that unstick the change btw? [11:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:58] it did \o/ [11:14:01] nice! [11:14:40] I was secretly hoping that last week's trainee would come back today but maybe they didn't get their +2/deploy rights sorted yet [11:14:46] maybe next week! [11:14:52] hopefully! [11:15:00] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [11:15:10] I might do some code backport later today but not right now [11:15:31] I guess that would be after the mediawiki train window, to not disturb that [11:15:35] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add logstash common profile [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [11:16:09] (03CR) 10Filippo Giunchedi: [C: 03+1] role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [11:18:20] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [11:18:47] (03CR) 10Filippo Giunchedi: [C: 03+1] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [11:19:00] (03CR) 10Jbond: "PCC https://puppet-compiler.wmflabs.org/compiler1002/31793/" [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:19:02] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:19:04] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: fork kibana profile into opensearch::dashboards [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [11:20:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:06] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:25:20] (03PS4) 10Jbond: P:base::labs: update labs base profile to include profile::base [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) [11:25:26] gotcha [11:25:35] hopefully the train goes nice and smoothly [11:34:33] (03PS2) 10JMeybohm: Add basic ingress support to chart scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 [11:35:37] 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10ayounsi) p:05Triage→03Medium [11:35:43] (03PS1) 10Lucas Werkmeister (WMDE): Enable dispatching via jobs by default [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732666 (https://phabricator.wikimedia.org/T291828) [11:35:49] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobsAllowedClients repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732667 (https://phabricator.wikimedia.org/T292604) [11:35:55] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732668 (https://phabricator.wikimedia.org/T292604) [11:36:01] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobs repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732669 (https://phabricator.wikimedia.org/T292604) [11:37:07] jouncebot: nowandnext [11:37:07] For the next 0 hour(s) and 22 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1100) [11:37:07] In 0 hour(s) and 22 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1200) [11:37:15] hrrrm [11:37:28] nah, 22 minutes probably isn’t enough for a Wikibase gate-and-submit [11:37:33] those changes will just have to wait [11:39:06] (03PS3) 10Jbond: P:base::production: update production roles to use P::base::production [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) [11:39:08] (03PS5) 10Jbond: P:base::labs: update labs base profile to include profile::base [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) [11:39:10] (03PS1) 10Jbond: P:base::production: add parameters to disable this profile [puppet] - 10https://gerrit.wikimedia.org/r/732670 (https://phabricator.wikimedia.org/T289661) [11:46:27] (03CR) 10Jbond: "PCC: labs https://puppet-compiler.wmflabs.org/compiler1003/31799/" [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:47:11] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Schema change s6 T278619 [11:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:17] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [11:47:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Schema change s6 T278619 [11:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:39] (03CR) 10Jbond: "PCC is no-op" [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:52:24] (03CR) 10Jbond: "PCC deployment-prep noop https://puppet-compiler.wmflabs.org/compiler1003/31799/" [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:54:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Schema change s5 T278619 [11:54:57] (03PS2) 10Jbond: P:standard: drop profile::standard as its no longer used [puppet] - 10https://gerrit.wikimedia.org/r/732644 [11:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:03] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [11:55:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Schema change s5 T278619 [11:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:36] (03PS2) 10Jbond: P:base::production: add parameters to disable this profile [puppet] - 10https://gerrit.wikimedia.org/r/732670 (https://phabricator.wikimedia.org/T289661) [11:57:51] (03PS4) 10Jbond: P:base::production: update production roles to use P::base::production [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) [11:58:02] (03PS6) 10Jbond: P:base::labs: update labs base profile to include profile::base [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) [11:58:07] (03PS3) 10Jbond: P:standard: drop profile::standard as its no longer used [puppet] - 10https://gerrit.wikimedia.org/r/732644 [11:58:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31803/console" [puppet] - 10https://gerrit.wikimedia.org/r/732644 (owner: 10Jbond) [11:59:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31804/console" [puppet] - 10https://gerrit.wikimedia.org/r/732644 (owner: 10Jbond) [12:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1200) [12:00:44] (03CR) 10jerkins-bot: [V: 04-1] P:base::production: update production roles to use P::base::production [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:01:34] (03CR) 10jerkins-bot: [V: 04-1] Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732668 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [12:01:45] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) My Google calendar is showing (Equinix -eqord UPS Maintenance-loss of redundancy from 10pm to 11pm CT on Oct 20th) maybe that is the cause. [12:01:55] (03CR) 10jerkins-bot: [V: 04-1] Enable dispatching via jobs by default [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732666 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [12:02:06] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:26] oops, my changes need one more backport actually [12:03:13] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:03:15] (03PS2) 10Lucas Werkmeister (WMDE): Enable dispatching via jobs by default [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732666 (https://phabricator.wikimedia.org/T291828) [12:03:18] (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobsAllowedClients repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732667 (https://phabricator.wikimedia.org/T292604) [12:03:21] (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732668 (https://phabricator.wikimedia.org/T292604) [12:03:24] (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobs repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732669 (https://phabricator.wikimedia.org/T292604) [12:03:27] (03PS1) 10Lucas Werkmeister (WMDE): Fix ExternalUserNames service wiring for local database [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732674 [12:06:02] (03CR) 10MMandere: prometheus: Add drmrs DC site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:10:58] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 94.38% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:14:46] (03PS2) 10MMandere: prometheus: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/732635 (https://phabricator.wikimedia.org/T282787) [12:29:32] (03CR) 10Jbond: [C: 03+2] P:base::production: add parameters to disable this profile [puppet] - 10https://gerrit.wikimedia.org/r/732670 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:29:41] (03CR) 10Jbond: [C: 03+2] P:base::production: update production roles to use P::base::production [puppet] - 10https://gerrit.wikimedia.org/r/732638 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:29:48] (03CR) 10Jbond: [C: 03+2] P:base::labs: update labs base profile to include profile::base [puppet] - 10https://gerrit.wikimedia.org/r/732639 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:29:55] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:standard: drop profile::standard as its no longer used [puppet] - 10https://gerrit.wikimedia.org/r/732644 (owner: 10Jbond) [12:30:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:48] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 92.99% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:34:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 11 hosts with reason: Schema change s7 T278619 [12:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:23] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [12:34:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 11 hosts with reason: Schema change s7 T278619 [12:34:29] (03CR) 10jerkins-bot: [V: 04-1] Fix ExternalUserNames service wiring for local database [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732674 (owner: 10Lucas Werkmeister (WMDE)) [12:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:47] (03CR) 10jerkins-bot: [V: 04-1] Remove dispatchViaJobs repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732669 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [12:36:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:57] 10SRE, 10Observability-Logging, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) I've tried using a separate mtail instance with a subset of the scripts used by the production instance, namely: - varnisherror... [12:43:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Schema change s2 T278619 [12:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Schema change s2 T278619 [12:43:25] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [12:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:35] (03PS1) 10Elukey: Add network policies for the kserve-inference chart deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/732677 (https://phabricator.wikimedia.org/T289834) [12:48:33] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 13 hosts with reason: Schema change s4 T278619 [12:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:40] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [12:48:43] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 13 hosts with reason: Schema change s4 T278619 [12:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 14 hosts with reason: Schema change s1 T278619 [12:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:21] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 14 hosts with reason: Schema change s1 T278619 [12:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:12] (03CR) 10Jbond: "I tried rebasing this but it turned out easier to just re-do it see: CR starting https://gerrit.wikimedia.org/r/c/operations/puppet/+/7326" [puppet] - 10https://gerrit.wikimedia.org/r/714983 (owner: 10David Caro) [12:55:24] (03Abandoned) 10Jbond: global: use p:b:production the main entry point [puppet] - 10https://gerrit.wikimedia.org/r/714983 (owner: 10David Caro) [12:56:41] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 7 hosts with reason: Schema change s3 T278619 [12:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:47] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [12:56:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 7 hosts with reason: Schema change s3 T278619 [12:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:56] good afternoon, it is train time! [13:00:04] hashar and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1300). [13:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:16] fingers crossed then :) [13:01:20] lets review the blocker task first [13:01:44] looks clear [13:01:53] (03CR) 10Volans: [C: 03+2] Upstream release v1.0.6 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/732647 (owner: 10Volans) [13:02:13] (03PS1) 10Hashar: all wikis to 1.38.0-wmf.5 refs T281169 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732678 [13:02:15] (03CR) 10Hashar: [C: 03+2] all wikis to 1.38.0-wmf.5 refs T281169 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732678 (owner: 10Hashar) [13:02:56] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.5 refs T281169 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732678 (owner: 10Hashar) [13:04:14] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.5 refs T281169 [13:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:20] T281169: 1.38.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T281169 [13:05:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:06:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-event_default.service,produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:05] unsurprisingly it is all quiet ;) [13:08:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:08:09] (03Merged) 10jenkins-bot: Upstream release v1.0.6 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/732647 (owner: 10Volans) [13:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:20] (03CR) 10Lucas Werkmeister (WMDE): "Random failure. Let’s not bother with a recheck – the important thing will be the gate-and-submit." [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732674 (owner: 10Lucas Werkmeister (WMDE)) [13:09:44] (03CR) 10Lucas Werkmeister (WMDE): "Random failure, let’s just see if the gate-and-submit fares better." [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732669 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [13:10:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:15:17] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10MatthewVernon) a:03MatthewVernon [13:16:52] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.14% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:17:35] (03PS2) 10Btullis: Correct the team-data-engineering alerts [alerts] - 10https://gerrit.wikimedia.org/r/732623 (https://phabricator.wikimedia.org/T293399) [13:20:42] well train looks quiet [13:23:54] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php-fpm: Allow changing location of the log files [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732641 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [13:24:51] (03CR) 10Elukey: Add network policies for the kserve-inference chart deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/732677 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [13:25:02] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:26:03] 10SRE-swift-storage: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10MatthewVernon) [13:28:45] 10SRE-swift-storage, 10Data-Persistence: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10MatthewVernon) [13:30:19] 10SRE-swift-storage, 10Data-Persistence: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10MatthewVernon) Stick the patch here just in case... ` diff --git a/swift/cli/recon.py b/swift/cli/recon.py index cd0952875..304a75a90 100644 --- a/swift/cli/recon.py +++ b/s... [13:31:28] (03CR) 10Volans: [C: 04-1] "Sorry I did a full grep of all generated files and found another couple" [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [13:34:18] (03PS2) 10Lucas Werkmeister (WMDE): maintain-meta_p: stop reading VariantSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/665116 [13:34:45] !log uploaded spicerack_1.0.6 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [13:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:41] (03CR) 10Marostegui: [WIP] Make profile::mariadb::dbstore_multiinstance more generic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [13:38:14] 10SRE-swift-storage, 10Data-Persistence: Monitoring (?+alerting) for Swift capacity - https://phabricator.wikimedia.org/T294019 (10MatthewVernon) [13:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:51] (03CR) 10Herron: [C: 03+2] kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/731976 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [13:49:29] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:49:29] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:56] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:55:56] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:49] (03CR) 10Kormat: [WIP] Make profile::mariadb::dbstore_multiinstance more generic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [14:01:14] hashar: since you closed the train task, do you mind if I start some wmf.5 backports before the window is over? [14:01:26] they should™ all be no-ops, enabling something by default that’s already enabled via mediawiki-config [14:01:58] Lucas_WMDE: yes definitely! though I am no more around I have to commute around [14:02:09] but for a backport I guess you can handle it just fine ? [14:02:16] yeah, I’ll just be careful :) [14:02:18] thanks! [14:02:26] (03PS2) 10Elukey: Add network policies for the kserve-inference chart deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/732677 (https://phabricator.wikimedia.org/T289834) [14:03:53] (03CR) 10jerkins-bot: [V: 04-1] Add network policies for the kserve-inference chart deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/732677 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [14:05:34] * hashar waves [14:06:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "This should be a safe backport – in production, the entity source database name is always a string, even when it refers to the local wiki." [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732674 (owner: 10Lucas Werkmeister (WMDE)) [14:07:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "We already set this in mediawiki/config.git for all repo wikis (see Ie7dd58a6c9), so it should be safe to backport." [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732666 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [14:07:52] let’s start with those two for now [14:07:56] * Lucas_WMDE waits for CI [14:10:19] (03PS28) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [14:10:53] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [14:14:37] (03PS29) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [14:15:43] (03CR) 10Filippo Giunchedi: [C: 03+1] Correct the team-data-engineering alerts [alerts] - 10https://gerrit.wikimedia.org/r/732623 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [14:19:51] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:19:51] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:05] (03PS1) 10Herron: kafka-jumbo: permit centrallog2002 via ferm [puppet] - 10https://gerrit.wikimedia.org/r/732711 (https://phabricator.wikimedia.org/T292196) [14:20:12] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10cmooney) I spent too long trying to find how to monitor the supply voltage but it doesn't seem to be possible? The PSU output voltage does show in "show chassis env pem", on the MX204s anywa... [14:22:20] (03PS2) 10Volans: install_server: uniform DHCP snippet automation [puppet] - 10https://gerrit.wikimedia.org/r/730416 (https://phabricator.wikimedia.org/T269855) [14:22:29] (03PS30) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [14:23:33] (03CR) 10Volans: [C: 03+2] install_server: uniform DHCP snippet automation [puppet] - 10https://gerrit.wikimedia.org/r/730416 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [14:23:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31809/console" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [14:23:56] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [14:24:05] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Papaul) p:05Medium→03Lowest [14:24:29] (03PS2) 10Herron: kafka-jumbo: permit centrallog2002 via ferm [puppet] - 10https://gerrit.wikimedia.org/r/732711 (https://phabricator.wikimedia.org/T292196) [14:26:16] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:26:16] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:39] (03PS3) 10Andrew Bogott: openStack:haproxy add tls termination for openstack in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/732397 (https://phabricator.wikimedia.org/T267194) [14:32:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:51] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, and 2 others: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) I did some more work on Eaton PDU and librenms today. From the `snmpwalk` output at P17569 I think all the data we need (per-phase data, environmental data... [14:33:49] !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [14:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:01] (03Merged) 10jenkins-bot: Fix ExternalUserNames service wiring for local database [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732674 (owner: 10Lucas Werkmeister (WMDE)) [14:34:04] (03Merged) 10jenkins-bot: Enable dispatching via jobs by default [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732666 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [14:34:18] ok! [14:35:13] pulling the first change to mwdebug1001 to test… [14:36:29] works fine as far as I can tell [14:36:42] (03CR) 10Andrew Bogott: [C: 03+2] openStack:haproxy add tls termination for openstack in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/732397 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [14:36:59] (the one “undefined method getExternalUserName()” in mwdebug logstash is just a typo by me in eval.php, harmless) [14:38:13] (03PS3) 10Andrew Bogott: openstack:haproxy add tls for nova metadata service [puppet] - 10https://gerrit.wikimedia.org/r/732398 (https://phabricator.wikimedia.org/T267194) [14:38:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:27] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/client/: Backport: [[gerrit:732674|Fix ExternalUserNames service wiring for local database]] (duration: 00m 57s) [14:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:48] (03CR) 10Andrew Bogott: [C: 04-1] "this needs a bit more thought -- VMs have this port hard-coded in cloud-init so adding a new tls port doesn't really help. I'm not sure i" [puppet] - 10https://gerrit.wikimedia.org/r/732398 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [14:41:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:28] (03Abandoned) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [14:42:50] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/config/Wikibase.default.php: Backport: [[gerrit:732666|Enable dispatching via jobs by default (T291828)]] (duration: 00m 55s) [14:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:56] T291828: Remove transitionary Dispatch Config - https://phabricator.wikimedia.org/T291828 [14:43:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Deploy in this order:" [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732667 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [14:56:00] (03PS1) 10Jbond: profile::rsyslog::kafka_destination_clusters: add default to cloud [puppet] - 10https://gerrit.wikimedia.org/r/732715 [14:56:30] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [14:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:09] (03PS2) 10Jbond: profile::rsyslog::kafka_destination_clusters: add default to cloud [puppet] - 10https://gerrit.wikimedia.org/r/732715 [15:00:27] (03PS1) 10Marostegui: backups: Send db2098:3318 to dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/732717 (https://phabricator.wikimedia.org/T290868) [15:00:49] (03CR) 10Marostegui: [C: 04-2] "Wait for the manual run to fully finish" [puppet] - 10https://gerrit.wikimedia.org/r/732717 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [15:03:55] (03CR) 10Marostegui: [C: 04-2] "The manual run finished successfully:" [puppet] - 10https://gerrit.wikimedia.org/r/732717 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [15:05:50] (03Merged) 10jenkins-bot: Remove dispatchViaJobsAllowedClients repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732667 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:08:19] (03CR) 10Marostegui: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/31813/" [puppet] - 10https://gerrit.wikimedia.org/r/732717 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [15:08:21] (03CR) 10Herron: [C: 03+1] profile::rsyslog::kafka_destination_clusters: add default to cloud [puppet] - 10https://gerrit.wikimedia.org/r/732715 (owner: 10Jbond) [15:08:33] RECOVERY - snapshot of s8 in codfw on alert1001 is OK: Last snapshot for s8 at codfw (db2098.codfw.wmnet:3318) taken on 2021-10-21 13:36:25 (1305 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:08:47] (03PS3) 10Elukey: Add network policies for the kserve-inference chart deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/732677 (https://phabricator.wikimedia.org/T289834) [15:08:49] (03CR) 10Btullis: [C: 03+2] Correct the team-data-engineering alerts [alerts] - 10https://gerrit.wikimedia.org/r/732623 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:10:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] I’m syncing that Wikibase backport [15:11:02] (03Merged) 10jenkins-bot: Correct the team-data-engineering alerts [alerts] - 10https://gerrit.wikimedia.org/r/732623 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:11:46] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/includes/: Backport: [[gerrit:732667|Remove dispatchViaJobsAllowedClients repo setting (T292604)]] (1/3) (duration: 00m 56s) [15:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:52] T292604: Clean up old change dispatching code - https://phabricator.wikimedia.org/T292604 [15:12:49] !log my next message accidentally says 1/3 again but it’s 2/3, sorry [15:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:06] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/config/: Backport: [[gerrit:732667|Remove dispatchViaJobsAllowedClients repo setting (T292604)]] (1/3) (duration: 00m 54s) [15:13:10] (don’t want to risk Ctrl+Cing scap just to fix the message) [15:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:28] (03CR) 10Kormat: [C: 03+1] "Looks plausible" [puppet] - 10https://gerrit.wikimedia.org/r/732717 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [15:13:42] (03CR) 10Marostegui: [C: 03+2] backups: Send db2098:3318 to dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/732717 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [15:13:44] "plausible" [15:13:52] :D [15:13:56] elukey: that's the best you can get from her [15:14:14] marostegui: yes yes you are so lucky, never ever got near that [15:14:28] (03CR) 10Andrew Bogott: [C: 03+2] profile::rsyslog::kafka_destination_clusters: add default to cloud [puppet] - 10https://gerrit.wikimedia.org/r/732715 (owner: 10Jbond) [15:14:38] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/tests/: Backport: [[gerrit:732667|Remove dispatchViaJobsAllowedClients repo setting (T292604)]] (3/3) (duration: 00m 56s) [15:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:17] (03PS1) 10Btullis: Remove all remaining references to alluxio [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) [15:15:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Deploy in this order:" [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732668 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:15:37] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:25] (03PS1) 10Andrew Bogott: Designate: add tls firewall rules to the api firewall class [puppet] - 10https://gerrit.wikimedia.org/r/732720 (https://phabricator.wikimedia.org/T267194) [15:20:36] (03CR) 10Andrew Bogott: [C: 03+2] Designate: add tls firewall rules to the api firewall class [puppet] - 10https://gerrit.wikimedia.org/r/732720 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [15:20:50] (03PS1) 10Ahmon Dancy: First rev of WMF docker-resource-monitor/docker-gc images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 [15:21:18] !log robh@cumin1001 START - Cookbook sre.dns.netbox [15:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:21] (03CR) 10jerkins-bot: [V: 04-1] Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732668 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:30:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Random failure, try again." [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732668 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:37:22] (03PS1) 10Herron: cloud.yaml add kafka_clusters, populate with kafka-logging brokers [puppet] - 10https://gerrit.wikimedia.org/r/732725 [15:38:41] (03PS2) 10Ahmon Dancy: First rev of WMF docker-resource-monitor/docker-gc images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) [15:39:04] (03CR) 10Btullis: "This check has now been migrated to Alertmanager." [puppet] - 10https://gerrit.wikimedia.org/r/731921 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:39:10] (03CR) 10Btullis: [C: 03+2] Remove HDFS Capacity Remaining check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/731921 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:40:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Deploy in this order:" [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732669 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:42:49] (03CR) 10Herron: [C: 03+2] cloud.yaml add kafka_clusters, populate with kafka-logging brokers [puppet] - 10https://gerrit.wikimedia.org/r/732725 (owner: 10Herron) [15:43:50] (03CR) 10Ahmon Dancy: "The intention is to run the resource monitor and docker-gc containers on relevant nodes (e.g., Jenkins and gitlab runners). To be puppeti" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) (owner: 10Ahmon Dancy) [15:43:58] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:41] (03PS1) 10Herron: Revert "kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers" [puppet] - 10https://gerrit.wikimedia.org/r/732692 [15:51:10] (03CR) 10jerkins-bot: [V: 04-1] Revert "kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers" [puppet] - 10https://gerrit.wikimedia.org/r/732692 (owner: 10Herron) [15:53:50] (03PS2) 10Herron: Revert "kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers" [puppet] - 10https://gerrit.wikimedia.org/r/732692 (https://phabricator.wikimedia.org/T293439) [15:55:26] (03Merged) 10jenkins-bot: Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732668 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:55:31] (03CR) 10Herron: [C: 03+2] Revert "kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers" [puppet] - 10https://gerrit.wikimedia.org/r/732692 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [15:55:39] alright, deploying that wmf.5 backport [15:58:11] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/includes/: Backport: [[gerrit:732668|Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting (T292604)]] (1/3) (duration: 00m 57s) [15:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:17] T292604: Clean up old change dispatching code - https://phabricator.wikimedia.org/T292604 [15:59:07] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10MGerlach) [15:59:12] (03CR) 10Elukey: [C: 03+2] Add network policies for the kserve-inference chart deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/732677 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [15:59:55] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/config/: Backport: [[gerrit:732668|Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting (T292604)]] (2/3) (duration: 00m 55s) [16:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:35] oh no, I didn’t get done in time with my backports [16:00:45] jbond, rzl: I still have one backport in gate-and-submit, is it okay if I finish that? [16:01:11] zuul says it’ll merge any moment now [16:01:14] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/tests/: Backport: [[gerrit:732668|Remove dispatchViaJobsPruneChangesTableInJobEnabled repo setting (T292604)]] (3/3) (duration: 00m 56s) [16:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:11] (03Merged) 10jenkins-bot: Remove dispatchViaJobs repo setting [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732669 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [16:02:19] ^ deploying that one (will be three scaps) [16:02:25] Lucas_WMDE: no rush, puppet window is a no-op today :) [16:02:31] ok phew :) [16:02:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:02] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10MGerlach) Hi. we have a new formal collaborator onboard: @Effeietsanders . They need access to HDFS and stat machines for a new research project. Let me know i... [16:03:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [16:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:52] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/includes/: Backport: [[gerrit:732669|Remove dispatchViaJobs repo setting (T292604)]] (1/3) (duration: 00m 56s) [16:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:58] T292604: Clean up old change dispatching code - https://phabricator.wikimedia.org/T292604 [16:05:10] (03PS1) 10Elukey: Fix the kserve-inference chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/732730 [16:05:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:05] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/config/: Backport: [[gerrit:732669|Remove dispatchViaJobs repo setting (T292604)]] (2/3) (duration: 00m 54s) [16:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:14] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/repo/tests/: Backport: [[gerrit:732669|Remove dispatchViaJobs repo setting (T292604)]] (3/3) (duration: 00m 56s) [16:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:24] alright, I’m done \o/ [16:07:40] cc rzl just in case you want to puppetize something after all [16:07:56] (well, I guess you do that normally anyways, but in case someone else needs something puppetized in the window? idk) [16:10:10] (03CR) 10Elukey: [C: 03+2] Fix the kserve-inference chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/732730 (owner: 10Elukey) [16:10:18] haha appreciate it! [16:12:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [16:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:52] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:13:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:38] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:14:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [16:17:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [16:24:19] (03PS1) 10Andrew Bogott: OpenStack: remove obsolete 'train' config files [puppet] - 10https://gerrit.wikimedia.org/r/732736 [16:24:29] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:25:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10RhinosF1) [16:26:28] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: remove obsolete 'train' config files [puppet] - 10https://gerrit.wikimedia.org/r/732736 (owner: 10Andrew Bogott) [16:26:31] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:27:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10RhinosF1) Will need @Ottomata or @odimitrijevic's approval I believe [16:34:20] akosiaris: you around (or anyone that can access the secret keys in prod?) [16:34:40] I want to try rotating a key to fix https://phabricator.wikimedia.org/T294010 [16:34:58] mutante: ^ are you're on clinic duty :) [16:35:14] *as [16:35:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Ottomata) Approved. This means ssh access + kerberos. [16:36:17] (03PS1) 10Ahmon Dancy: php-fpm: Add settings to control debuggability [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 [16:36:20] Thanks Reedy :) [16:37:09] (03PS1) 10Andrew Bogott: OpenStack: Always use the tls port (25357) for keystone admin endpoints [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [16:38:32] (03CR) 10Majavah: [C: 04-1] "protocol needs to be changed to https if you're using the tls port" [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [16:39:33] (03PS1) 10Cwhite: httpd-fcgi: add cee_ecs_accesslog_170 log format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732739 [16:39:54] (03PS1) 10Samuel (WMF): maintain-views.yaml: Restrict `localuser` table to prevent disclosure [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) [16:42:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) I take it the main concern here is allocating a public IPv4 address, which is a scarce resource, no? It seems we have a r... [16:43:30] (03PS2) 10Andrew Bogott: OpenStack: Always use the tls port (25357) for keystone admin endpoints [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [16:47:18] (03CR) 10Majavah: OpenStack: Always use the tls port (25357) for keystone admin endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [16:49:01] (03CR) 10Reedy: [C: 03+1] maintain-views.yaml: Restrict `localuser` table to prevent disclosure [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [16:49:22] (03Abandoned) 10Cwhite: profile: improve kafka_shipper rsyslog output ssl options [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [16:50:04] (03CR) 10Majavah: [C: 04-1] maintain-views.yaml: Restrict `localuser` table to prevent disclosure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [16:50:59] (03CR) 10Legoktm: package_builder: Add hook for building PHP 7.4 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732097 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [16:58:47] (03PS3) 10Andrew Bogott: OpenStack: Always use the tls ports (5000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [17:00:04] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1700). [17:05:27] (03PS1) 10Herron: Revert "profile::rsyslog::kafka_destination_clusters: add default to cloud" [puppet] - 10https://gerrit.wikimedia.org/r/732697 [17:05:49] (03PS2) 10Herron: Revert "profile::rsyslog::kafka_destination_clusters: add default to cloud" [puppet] - 10https://gerrit.wikimedia.org/r/732697 [17:06:09] (03CR) 10jerkins-bot: [V: 04-1] Revert "profile::rsyslog::kafka_destination_clusters: add default to cloud" [puppet] - 10https://gerrit.wikimedia.org/r/732697 (owner: 10Herron) [17:06:37] (03PS1) 10Herron: Revert "cloud.yaml add kafka_clusters, populate with kafka-logging brokers" [puppet] - 10https://gerrit.wikimedia.org/r/732698 [17:09:03] (03CR) 10Samuel (WMF): maintain-views.yaml: Restrict `localuser` table to prevent disclosure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [17:09:35] (03PS2) 10Samuel (WMF): maintain-views.yaml: Restrict `localuser` table to prevent disclosure [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) [17:10:38] (03CR) 10Herron: [C: 03+2] Revert "cloud.yaml add kafka_clusters, populate with kafka-logging brokers" [puppet] - 10https://gerrit.wikimedia.org/r/732698 (owner: 10Herron) [17:11:35] (03PS3) 10Herron: Revert "profile::rsyslog::kafka_destination_clusters: add default to cloud" [puppet] - 10https://gerrit.wikimedia.org/r/732697 [17:12:26] (03CR) 10Herron: [C: 03+2] Revert "profile::rsyslog::kafka_destination_clusters: add default to cloud" [puppet] - 10https://gerrit.wikimedia.org/r/732697 (owner: 10Herron) [17:12:55] (03PS4) 10Andrew Bogott: OpenStack: Always use the tls ports (5000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [17:13:32] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: Always use the tls ports (5000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [17:15:33] (03PS1) 10Majavah: kubeadm::kubectl: install kubectl-sudo [puppet] - 10https://gerrit.wikimedia.org/r/732747 [17:16:13] (03CR) 10jerkins-bot: [V: 04-1] kubeadm::kubectl: install kubectl-sudo [puppet] - 10https://gerrit.wikimedia.org/r/732747 (owner: 10Majavah) [17:20:07] (03PS2) 10Majavah: kubeadm::kubectl: install kubectl-sudo [puppet] - 10https://gerrit.wikimedia.org/r/732747 [17:24:18] (03PS5) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [17:24:20] (03CR) 10Dzahn: cumin: add an alias for new pki roles and add to misc-others (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/732425 (owner: 10Dzahn) [17:24:54] (03PS2) 10Dzahn: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 [17:25:42] (03CR) 10Dzahn: pontoon: disable puppetmaster trying to pull _private_ geoip databases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732405 (owner: 10Dzahn) [17:25:53] 10SRE, 10Analytics, 10SRE Observability (FY2021/2022-Q2): statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10odimitrijevic) p:05Triage→03Medium [17:25:55] (03Abandoned) 10Dzahn: pontoon: disable puppetmaster trying to pull _private_ geoip databases [puppet] - 10https://gerrit.wikimedia.org/r/732405 (owner: 10Dzahn) [17:29:22] Reedy: yea, I'll look at that ticket now [17:32:45] (03PS1) 10Btullis: Add an alert for HDFS corrupt blocks [alerts] - 10https://gerrit.wikimedia.org/r/732748 (https://phabricator.wikimedia.org/T293399) [17:36:31] mvolz: hi, do you know what the variable is called in the private repo? I tried searching for "worldcat" so far, but no [17:36:48] looking now [17:36:57] mutante: wskey [17:37:54] mvolz: alright, I see that. ACK. so, you have a different key somewhere? [17:38:17] do you have that on deploy1002 or something? [17:38:27] I can reissue it and we can see if that does the trick [17:38:34] ok [17:38:41] I have it in an online dashboard, how do I get it to you? [17:38:47] I will make a backup of the old one before we change it [17:38:57] do you have shell access to any wikimedia servers? [17:39:07] I have shell access to deploy [17:39:38] ok, can you just dump it into a file in your home dir, not world-readable? I will take it as root and paste it to the private repo [17:40:30] looks like the key needs to be in 2 places, scb cluster and deployment server, but it's the same one [17:41:14] mutante: done [17:42:02] (03CR) 10Andrew Bogott: [C: 03+1] "This looks basically right to me." [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [17:42:03] ok, on it [17:47:09] replaced in the repo in 4 places, 1 one for scb and 3 for deployment server (staging, eqiad and codfw could theoretically use separate keys but it's the same in all places for now) [17:47:15] running puppet on deploy1002 [17:48:04] scb isn't a thing anymore but remnants in the private repo [17:48:40] And the Beta Cluster. :-( [17:48:58] mvolz: it's replaced in the helmfile, who usually deploys it now? [17:49:04] I can redeploy [17:49:18] ok, so as I said, I replaced it in all 3, staging, eqiad and codfw [17:49:22] old key before new key after [17:49:50] can revert if needed but it's already broken ..so yea [17:49:53] right? [17:50:11] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [17:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:52] right, already broken. Also when the key gets reissued the old one is dead anyway so! [17:51:07] ah, *nod*, yea [17:51:45] looks like it's working in staging, I'll do the other two now [17:51:56] nice [17:52:16] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [17:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:41] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [17:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:50] !log citoid - replaced "wskey" for worldcat in private repo as requested on T294010 (is in 4 places, 3 for deployment_server/k8s and one remnant for scb) [17:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:56] T294010: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 [17:55:40] !log that's a key for https://www.worldcat.org/whatis/default.jsp btw for those wondering [17:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:30] ok, everything looks fixed! Thanks for your help! [17:56:42] yay:) glad it fixed it [17:56:59] always nice to close an UBN [17:58:50] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) a:03Jclark-ctr [17:59:03] (03PS2) 10Legoktm: package_builder: Add hook to stop rebuilding man-db [puppet] - 10https://gerrit.wikimedia.org/r/732383 (https://phabricator.wikimedia.org/T276632) [17:59:05] (03PS2) 10Legoktm: [WIP] package_builder: Refactor PHP hook into a template [puppet] - 10https://gerrit.wikimedia.org/r/732098 [17:59:22] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) The part shipped, hopefully arrives today [17:59:25] (03CR) 10Samuel (WMF): maintain-views.yaml: Restrict `localuser` table to prevent disclosure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [17:59:38] (03PS1) 10Jbond: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) [17:59:40] (03PS1) 10Herron: Revert "Revert "kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers"" [puppet] - 10https://gerrit.wikimedia.org/r/732701 [17:59:55] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31830/console" [puppet] - 10https://gerrit.wikimedia.org/r/732098 (owner: 10Legoktm) [18:00:04] RoanKattouw and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1800). [18:00:04] No Gerrit patches in the queue for this window AFAICS. [18:01:05] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Dzahn) p:05Lowest→03Low I think we can live without it and it's right to lower prio. ACK [18:01:44] (03PS3) 10Legoktm: package_builder: Refactor PHP hook into a template [puppet] - 10https://gerrit.wikimedia.org/r/732098 [18:02:46] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: Disable man-db in pbuilder in package_builder on deneb - https://phabricator.wikimedia.org/T276632 (10Legoktm) a:03Legoktm [18:04:21] (03CR) 10Dzahn: [C: 03+2] global: remove all "filtertags" lines [puppet] - 10https://gerrit.wikimedia.org/r/732439 (owner: 10Dzahn) [18:04:27] (03PS3) 10Dzahn: global: remove all "filtertags" lines [puppet] - 10https://gerrit.wikimedia.org/r/732439 [18:05:13] going to merge a puppet change that touches a LOT of files at once but just removes a special comment from each of them. that "filtertags" stuff (if you ever wondered) isn't doing anything anymore [18:05:21] stealing the B&C for a security bugfix [18:05:31] (unless mutante wants me to wait?) [18:05:56] nah, you can do it [18:06:02] thanks [18:06:02] jouncebot: now [18:06:02] For the next 0 hour(s) and 53 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1800) [18:06:25] (03CR) 10Jbond: package_builder: Add hook for building PHP 7.4 packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732097 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [18:08:06] (03Abandoned) 10Dzahn: apt: use ensure_resource for exec[apt-get update] to avoid duplicate defs [puppet] - 10https://gerrit.wikimedia.org/r/732391 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:08:42] (03Abandoned) 10Dzahn: pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:08:55] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:10:47] (testing the secpatch, conversation happens in _security) [18:11:57] ack, ty. i'm actually waiting, all yours [18:13:06] (03PS2) 10Herron: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [18:14:05] thanks mutante, I'll ping you when done then [18:14:59] (03PS3) 10Herron: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [18:17:58] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23), and 2 others: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) Thanks to other activity, I realized my patch didn't cover MultiHttpClient, which is used by Echo for cros... [18:19:55] (03CR) 10Herron: "cloud PCC: https://puppet-compiler.wmflabs.org/compiler1001/31833/" [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [18:20:25] (03PS1) 10Dzahn: conftool-data: remove mw2280.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/732756 (https://phabricator.wikimedia.org/T290708) [18:22:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10MGerlach) [18:24:29] (03PS2) 10Dzahn: cumin: drop tor_relay from aliases [puppet] - 10https://gerrit.wikimedia.org/r/732410 (https://phabricator.wikimedia.org/T243288) [18:25:30] (03Abandoned) 10Herron: Revert "Revert "kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers"" [puppet] - 10https://gerrit.wikimedia.org/r/732701 (owner: 10Herron) [18:25:53] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/566687/" [puppet] - 10https://gerrit.wikimedia.org/r/732410 (https://phabricator.wikimedia.org/T243288) (owner: 10Dzahn) [18:27:44] 10SRE, 10ops-codfw, 10Patch-For-Review: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Dzahn) a:05Papaul→03Dzahn [18:27:50] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [18:28:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Effeietsanders) [18:28:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Effeietsanders) [18:29:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Effeietsanders) Thanks @MGerlach . I added the information. [18:31:33] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) >>! In T283582#7447115, @cmooney wrote: > There are many more in eqiad, bu... [18:32:52] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @cmooney many thanks for the txt :) [18:36:10] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) So the ones alerting in eqiad are one case of 2.30.30.30 and one case of "... [18:36:48] ACKNOWLEDGEMENT - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582#7449164 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:36:48] ACKNOWLEDGEMENT - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582#7449164 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:38:13] ACKNOWLEDGEMENT - Device not healthy -SMART- on ms-be2028 is CRITICAL: cluster=swift device=None instance=ms-be2028 job=node site=codfw daniel_zahn https://phabricator.wikimedia.org/T294001 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [18:38:13] ACKNOWLEDGEMENT - Disk space on ms-be2028 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error daniel_zahn https://phabricator.wikimedia.org/T294001 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2028&var-datasource=codfw+prometheus/ops [18:38:36] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10Dzahn) [18:41:26] (03CR) 10Herron: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [18:41:36] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Patch-For-Review, 10User-notice: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10Quiddity) Hi @AntiCompositeNumber. If I understand correctly, this should be included in the next issue of Tech New... [18:41:58] (03PS4) 10Herron: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [18:44:35] (03PS1) 10Legoktm: Update $wgTimelineFonts for new path to unifont in Shellbox container [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732761 (https://phabricator.wikimedia.org/T293050) [18:44:49] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10Dzahn) p:05Triage→03Medium @Papaul It reports another failed disk in a RAID on a swift machine but this is HP and I am not even sure I see in the output (same as above) what I would have to... [18:47:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Dzahn) 05Stalled→03In progress [18:47:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Dzahn) a:05DAbad→03Dzahn Alright, thank you. Will proceed :) [18:50:16] (03PS5) 10Herron: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [18:51:33] (03CR) 10jerkins-bot: [V: 04-1] kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [18:53:17] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:26] !log dumpsdata1003 - sudo systemctl reset-failed to clear Icinga alert about failed cleanup_tmpdumps.service [18:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:44] !log Deploy security patch for T285116 (wmf.4, wmf.5) [18:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:10] mutante: I'm done [18:54:25] urbanecm: ACK! thanks [18:58:54] (03PS6) 10Herron: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [19:00:04] hashar and dancy: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T1900). [19:02:53] (03PS7) 10Jbond: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) [19:02:55] (03PS1) 10Jbond: P:base: move profile::rsyslog::kafka_shipper to production only [puppet] - 10https://gerrit.wikimedia.org/r/732764 [19:03:52] (03CR) 10jerkins-bot: [V: 04-1] kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [19:07:27] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@b2912b7]: (no justification provided) [19:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:36] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@b2912b7]: (no justification provided) (duration: 00m 08s) [19:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:58] huh, that was using --dry-run, didn't expect it to log [19:09:01] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/732765 [19:09:36] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@b2912b7]: deploy 0.3.90, incl oauth, to wcqs [19:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:00] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@b2912b7]: deploy 0.3.90, incl oauth, to wcqs (duration: 00m 23s) [19:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:18] (03PS8) 10Jbond: kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) [19:12:49] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/732766 [19:16:22] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Patch-For-Review, 10User-notice: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10Legoktm) >>! In T253600#7449199, @Quiddity wrote: > Hi @AntiCompositeNumber. If I understand correctly, this should... [19:17:05] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/732768 [19:18:12] (03CR) 10Legoktm: [C: 03+2] shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/732768 (owner: 10PipelineBot) [19:20:45] 10SRE, 10Analytics, 10Discovery, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) This happened today, somehow there were recentchange events with timestamps from around 2007 in... [19:22:30] (03Merged) 10jenkins-bot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/732768 (owner: 10PipelineBot) [19:23:41] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [19:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:52] (03CR) 10Jbond: [C: 03+2] P:base: move profile::rsyslog::kafka_shipper to production only [puppet] - 10https://gerrit.wikimedia.org/r/732764 (owner: 10Jbond) [19:24:54] (03PS1) 10Ottomata: eventgate-main - Bump image version to get maps tiles.change fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/732771 (https://phabricator.wikimedia.org/T293366) [19:25:12] (03CR) 10Jbond: [C: 03+1] "pcc: https://puppet-compiler.wmflabs.org/compiler1002/31841/" [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [19:26:12] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31843/console" [puppet] - 10https://gerrit.wikimedia.org/r/732764 (owner: 10Jbond) [19:26:14] (03CR) 10Dzahn: "Or should we ask for replacement, Wolfgang?" [puppet] - 10https://gerrit.wikimedia.org/r/732756 (https://phabricator.wikimedia.org/T290708) (owner: 10Dzahn) [19:30:20] (03CR) 10Ottomata: [C: 03+2] eventgate-main - Bump image version to get maps tiles.change fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/732771 (https://phabricator.wikimedia.org/T293366) (owner: 10Ottomata) [19:30:29] (03CR) 10Herron: [C: 03+1] "thank you for this, very much in support of moving towards sensible defaults" [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [19:30:49] (03CR) 10Herron: [C: 03+2] kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/732700 (https://phabricator.wikimedia.org/T293439) (owner: 10Jbond) [19:31:07] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) a:03Dzahn Thanks @Ottomata for answering that right away :) [19:33:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) [19:33:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) 05Open→03In progress [19:35:41] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [19:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:11] (03PS2) 10Legoktm: Update $wgTimelineFonts for new path to unifont in Shellbox container [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732761 (https://phabricator.wikimedia.org/T293050) [19:36:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Dzahn) [19:36:18] (03CR) 10Legoktm: [C: 03+2] Update $wgTimelineFonts for new path to unifont in Shellbox container [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732761 (https://phabricator.wikimedia.org/T293050) (owner: 10Legoktm) [19:36:52] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10Papaul) a:03Papaul [19:37:07] (03Merged) 10jenkins-bot: Update $wgTimelineFonts for new path to unifont in Shellbox container [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732761 (https://phabricator.wikimedia.org/T293050) (owner: 10Legoktm) [19:38:12] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [19:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:32] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 (10Papaul) @Dzahn thank you. Yes there is a failed disk. Since the server is out of warranty, i will check when i am next onsite to replace the disk with one disk from a decom server. ` -Logical D... [19:42:41] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Update $wgTimelineFonts for new path to unifont in Shellbox container (T293050) (duration: 00m 55s) [19:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:47] T293050: Characters are missing in the chart of EasyTimeline on zhwiki - https://phabricator.wikimedia.org/T293050 [19:43:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:04] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) ` papaul@cr2-eqdfw> show chassis environment Power PEM 0 OK 35 degrees C / 95 degrees F PEM 1 OK 32 degrees C /... [19:51:20] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) ` PEM 0 status: State Online Airflow Front to Back Temperature OK 35 degrees C / 95 degrees F Temperature... [19:53:42] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) It was maybe just a temporary power feed issue. I will check the router again next week and see if all looks ago. [20:02:09] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [20:02:09] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [20:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:11] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [20:04:11] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [20:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:37] (03CR) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [20:38:09] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) New raid card has been installed @Cmjohnson [20:52:25] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@1309a97] (wcqs): dry run wcqs deploy [20:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:01] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@1309a97] (wcqs): dry run wcqs deploy (duration: 00m 35s) [20:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:23] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:53:56] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@1309a97] (wcqs): dry run wcqs deploy [20:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:09] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@1309a97] (wcqs): dry run wcqs deploy (duration: 00m 13s) [20:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:29] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [21:00:43] (03PS6) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [21:00:45] (03PS1) 10Andrew Bogott: Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 [21:01:35] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [21:09:33] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:35] (03PS2) 10Andrew Bogott: Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 [21:14:37] (03PS7) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [21:15:49] (03PS1) 10Legoktm: Update README for purpose of this repository, remove unused fonts [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/732792 [21:17:56] (03PS5) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 [21:18:58] (03CR) 10Legoktm: Update configuration related to disabling Score functionality (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm) [21:19:07] (03PS6) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 [21:21:43] (03PS3) 10Andrew Bogott: Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 [21:21:45] (03PS8) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [21:22:18] (03CR) 10jerkins-bot: [V: 04-1] Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 (owner: 10Andrew Bogott) [21:22:32] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [21:23:56] (03PS4) 10Andrew Bogott: Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 [21:23:58] (03PS9) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [21:24:27] (03CR) 10jerkins-bot: [V: 04-1] Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 (owner: 10Andrew Bogott) [21:24:46] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [21:29:21] (03PS5) 10Andrew Bogott: Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 [21:29:23] (03PS10) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [21:31:38] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [21:31:42] (03PS6) 10Andrew Bogott: Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 [21:31:44] (03PS11) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [21:41:53] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@13448f1] (wcqs): Deploy 0.3.90 to WCQS [21:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:40] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@13448f1] (wcqs): Deploy 0.3.90 to WCQS (duration: 02m 47s) [21:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:53] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Keystone: pass in the auth_uri from the keystone profile [puppet] - 10https://gerrit.wikimedia.org/r/732789 (owner: 10Andrew Bogott) [22:15:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) confirmed on "NDA and MOU: Volunteer accounts with Server and LDAP-level access" doc [22:16:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) [22:18:09] (03PS1) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) [22:18:43] (03CR) 10jerkins-bot: [V: 04-1] query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:22:00] (03PS2) 10Ebernhardson: query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) [22:22:40] (03CR) 10jerkins-bot: [V: 04-1] query_service: Add new oauth related configuration [puppet] - 10https://gerrit.wikimedia.org/r/732801 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:38:11] (03PS1) 10Dzahn: admin: add shell account for effeietsanders, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/732805 (https://phabricator.wikimedia.org/T294038) [22:39:34] (03CR) 10Dzahn: [C: 03+2] "expiry date and contact taken from Google doc with NDA info, approvals on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/732805 (https://phabricator.wikimedia.org/T294038) (owner: 10Dzahn) [22:42:30] !log T294038 [krb1001:~] $ sudo manage_principals.py create effeietsanders ... Principal successfully created. . .Successfully sent email [22:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:34] (03PS1) 10Dzahn: admin: add 'krb: present' to effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/732826 (https://phabricator.wikimedia.org/T294038) [22:45:12] (03CR) 10Dzahn: [C: 03+2] admin: add 'krb: present' to effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/732826 (https://phabricator.wikimedia.org/T294038) (owner: 10Dzahn) [22:47:24] (03PS1) 10Cwhite: profile: add enable_relay flag to statsd exporter profile [puppet] - 10https://gerrit.wikimedia.org/r/732827 (https://phabricator.wikimedia.org/T240685) [22:47:26] (03PS1) 10Cwhite: role: install statsd_exporter on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/732828 (https://phabricator.wikimedia.org/T240685) [22:47:28] (03PS1) 10Cwhite: hiera: remove duplicate fpm_workers_multipier key [puppet] - 10https://gerrit.wikimedia.org/r/732829 [22:50:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) access granted. shell user has been created on bast1003.wikimedia.org , puppet will have created it on all other relevan... [22:52:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) 05In progress→03Resolved @MGerlach Done! I took the expiry date of 2022-04-15 from the NDA doc and you are the expiry... [23:00:04] brennen: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T2300) [23:00:05] James_F: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:28] Hey hey. [23:00:30] I can deploy. [23:00:47] (03PS6) 10Jforrester: Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 (https://phabricator.wikimedia.org/T277932) [23:00:51] (03CR) 10Jforrester: [C: 03+2] Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [23:01:09] hey James_F we have a backport trainee, could we deploy your patches please? [23:01:22] thcipriani: Oh! Sure, happy to defer instead. [23:01:26] Go for it. [23:01:49] thcipriani: The composer dev tools one is a bit messy so feel free to leave that for me. [23:02:17] (03Merged) 10jenkins-bot: Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [23:06:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Dzahn) WMF user confirmed per https://meta.wikimedia.org/wiki/Special:Log?type=newusers&user=&page=EChetty+%28WMF%29&wpdate=&tagfilter=&subtype= [23:07:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Dzahn) [23:07:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:54] James_F: the patch you merged is live on mwdebug1002: could you test what needs testing [23:08:10] thcipriani: If the site's up and there are no fatals, it's good. [23:09:26] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-10-21-224232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/732832 (https://phabricator.wikimedia.org/T294055) [23:10:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:07] cool James_F going live [23:15:12] (03CR) 10Cwhite: [C: 03+2] profile: fork kibana profile into opensearch::dashboards [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [23:15:20] (03PS8) 10Cwhite: profile: fork kibana profile into opensearch::dashboards [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) [23:15:58] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:720362|Add new config names for CentralAuth denylist controls (T277932)]] (duration: 00m 55s) [23:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:05] T277932: Address Voice and Tone issues in CentralAuth - https://phabricator.wikimedia.org/T277932 [23:17:19] ^ IS.php live, I'll sync the tests, too just for completness sake [23:17:25] Thanks! [23:17:41] (03PS2) 10Jforrester: CommonSettings: Drop legacy CentralAuth config flag, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730946 (https://phabricator.wikimedia.org/T277932) [23:18:32] !log thcipriani@deploy1002 Synchronized tests/multiversion/StaticSettingsTest.php: Config: [[gerrit:720362|Add new config names for CentralAuth denylist controls (T277932)]] (duration: 00m 55s) [23:18:34] (03CR) 10Thcipriani: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730946 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [23:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:41] (03Merged) 10jenkins-bot: CommonSettings: Drop legacy CentralAuth config flag, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730946 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [23:20:18] ^ James_F is this another one where if nothing explodes there's no problem? [23:20:32] thcipriani: Yeah, there's no trace of code reading this anywhere. [23:20:41] thcipriani: So it should Just Work™. [23:21:40] famous last words :D [23:21:42] ok, going [23:23:30] (03PS6) 10Jforrester: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730038 (owner: 10Zabe) [23:24:15] (03PS1) 10Dwisehaupt: Add frpm1002, frauth1002, pay-lvs1003, pay-lvs1004 [dns] - 10https://gerrit.wikimedia.org/r/732834 (https://phabricator.wikimedia.org/T289812) [23:24:49] (03PS3) 10Juan90264: Enable talk for mobile users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 [23:25:00] !log thcipriani@deploy1002 Synchronized wmf-config: Config: [[gerrit:730946|CommonSettings: Drop legacy CentralAuth config flag, never read (T277932)]] (duration: 00m 55s) [23:25:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:07] T277932: Address Voice and Tone issues in CentralAuth - https://phabricator.wikimedia.org/T277932 [23:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:17] (03CR) 10Jforrester: [C: 03+1] build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730038 (owner: 10Zabe) [23:25:19] (03PS4) 10Juan90264: Enable talk for mobile users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 (https://phabricator.wikimedia.org/T293946) [23:25:52] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-10-21-224232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/732832 (https://phabricator.wikimedia.org/T294055) (owner: 10BryanDavis) [23:26:44] (03PS1) 10Dzahn: admin: create user for echetty, privateusers-data, NO ssh, NO kerberos [puppet] - 10https://gerrit.wikimedia.org/r/732835 (https://phabricator.wikimedia.org/T293455) [23:26:59] James_F: could you take over for the last one for me? [23:27:06] thcipriani: Sure! [23:27:08] <3 [23:27:24] https://deploy-commands.toolforge.org/bacc/730038 is hilarious. [23:27:28] (03CR) 10Jforrester: [C: 03+2] build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730038 (owner: 10Zabe) [23:27:31] (03CR) 10jerkins-bot: [V: 04-1] admin: create user for echetty, privateusers-data, NO ssh, NO kerberos [puppet] - 10https://gerrit.wikimedia.org/r/732835 (https://phabricator.wikimedia.org/T293455) (owner: 10Dzahn) [23:27:40] (03CR) 10Dzahn: "going by similar case https://phabricator.wikimedia.org/T283190" [puppet] - 10https://gerrit.wikimedia.org/r/732835 (https://phabricator.wikimedia.org/T293455) (owner: 10Dzahn) [23:27:49] thanks for taking care of that one :) [23:28:07] We might want to point LibUp at prod config. [23:28:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:28:15] Not with merge rights, of course. :-) [23:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:28] (03Merged) 10jenkins-bot: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730038 (owner: 10Zabe) [23:30:10] (03PS2) 10Dzahn: admin: create user for echetty, privateusers-data, NO ssh, NO kerberos [puppet] - 10https://gerrit.wikimedia.org/r/732835 (https://phabricator.wikimedia.org/T293455) [23:30:22] brennen|afk: Hello? Online? [23:31:08] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-10-21-224232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/732832 (https://phabricator.wikimedia.org/T294055) (owner: 10BryanDavis) [23:31:14] the afk part makes it unlikely, but you never know here :) [23:31:41] Okay [23:31:58] jouncebot: now [23:31:58] For the next 0 hour(s) and 28 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211021T2300) [23:32:00] 10SRE, 10Traffic-Icebox, 10Sustainability (Incident Followup): upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517 (10Krinkle) 05Open→03Resolved a:03Krinkle [23:32:39] (03CR) 10Dzahn: [C: 03+2] admin: create user for echetty, privateusers-data, NO ssh, NO kerberos [puppet] - 10https://gerrit.wikimedia.org/r/732835 (https://phabricator.wikimedia.org/T293455) (owner: 10Dzahn) [23:32:42] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [23:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:05] !log jforrester@deploy1002 Synchronized wmf-config: Config: [[gerrit:730038|build: Upgrade composer testing stack to latest as used Wikimedia-wide]] (duration: 00m 55s) [23:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:26] Any online deployers? [23:34:51] !log jforrester@deploy1002 Synchronized docroot/noc/conf/index.php: Config: [[gerrit:730038|build: Upgrade composer testing stack to latest as used Wikimedia-wide]] (duration: 00m 54s) [23:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:04] !log jforrester@deploy1002 Synchronized multiversion/: Config: [[gerrit:730038|build: Upgrade composer testing stack to latest as used Wikimedia-wide]] (duration: 00m 55s) [23:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:08] !log jforrester@deploy1002 Synchronized w/static.php: Config: [[gerrit:730038|build: Upgrade composer testing stack to latest as used Wikimedia-wide]] (duration: 00m 54s) [23:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:31] Juan_90264: I've just finished deploying. [23:37:34] Juan_90264: Why? [23:38:10] !log jforrester@deploy1002 Synchronized w/fatal-error.php: Config: [[gerrit:730038|build: Upgrade composer testing stack to latest as used Wikimedia-wide]] (duration: 00m 54s) [23:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:17] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/732705 [23:38:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Dzahn) 05In progress→03Resolved access granted! @EChetty @DAbad This is done. This is the "analytics-privatedata-users (no kerberos, no ssh)" o... [23:39:22] Juan_90264: https://phabricator.wikimedia.org/T293946#7449599 says the team are going to discuss it and think about whether they can implement. [23:39:41] Juan_90264: Why did you write a patch for it so quickly? I don't think that would be appropriate to deploy. [23:40:13] (03CR) 10Jforrester: [C: 04-1] "Per the task, the team are going to look at this first: T293946#7449599" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 (https://phabricator.wikimedia.org/T293946) (owner: 10Juan90264) [23:40:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:28] Juan_90264: Also your edit to the page was to next week's deployment slot; did you mean that? [23:46:02] OK, I'm declaring config window closed. Thanks all. [23:47:21] (03PS1) 10Zabe: flaggedrevs: drop legacy FlaggedRevs config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732836 [23:57:34] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/31851/data.wikipathways.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn) [23:58:07] (03CR) 10Dzahn: [V: 03+1] "this is used by https://openstack-browser.toolforge.org/puppetclass/role::simplelamp2" [puppet] - 10https://gerrit.wikimedia.org/r/731183 (owner: 10Dzahn) [23:58:55] (03CR) 10Dzahn: "does not appear to be used, unlike simplelamp2 (https://openstack-browser.toolforge.org/puppetclass/)" [puppet] - 10https://gerrit.wikimedia.org/r/731184 (owner: 10Dzahn)