[00:00:05] RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T0000). Please do the needful. [00:00:05] SCardenasM, eigyan, nn1l2, zabe, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] hi [00:00:19] o/ [00:00:22] Hello! [00:00:26] the window is quite full [00:01:09] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:39] I can deploy [00:02:44] (y) [00:02:45] (03PS2) 10Catrope: fawiki: Exempt userspaces from being indexed by search engines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755018 (https://phabricator.wikimedia.org/T299363) (owner: 104nn1l2) [00:02:49] (03CR) 10Catrope: [C: 03+2] fawiki: Exempt userspaces from being indexed by search engines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755018 (https://phabricator.wikimedia.org/T299363) (owner: 104nn1l2) [00:03:32] (03Merged) 10jenkins-bot: fawiki: Exempt userspaces from being indexed by search engines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755018 (https://phabricator.wikimedia.org/T299363) (owner: 104nn1l2) [00:04:36] nn1l2: Your fawiki patch is on mwdebug1002, please test [00:04:50] give me a min [00:05:11] (03PS3) 10Catrope: azwiki: Change alias Q to QA for the draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755026 (https://phabricator.wikimedia.org/T299332) (owner: 104nn1l2) [00:05:16] (03CR) 10Catrope: [C: 03+2] azwiki: Change alias Q to QA for the draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755026 (https://phabricator.wikimedia.org/T299332) (owner: 104nn1l2) [00:05:45] LGTM [00:06:17] (03Merged) 10jenkins-bot: azwiki: Change alias Q to QA for the draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755026 (https://phabricator.wikimedia.org/T299332) (owner: 104nn1l2) [00:07:35] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7385 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [00:07:57] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755018|fawiki: Exempt userspaces from being indexed by search engines (T299363)]] (duration: 00m 54s) [00:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:01] T299363: Exempt userspaces from being indexed by search engines on Farsi Wikipedia - https://phabricator.wikimedia.org/T299363 [00:08:27] nn1l2: Your azwiki patch is ready for testing now [00:08:37] (03CR) 10Catrope: [C: 03+2] Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754914 (owner: 104nn1l2) [00:08:58] mwdebug1002? [00:09:53] Good to go [00:14:34] (03CR) 10Cwhite: [C: 03+1] Revert "profile: move statsd writes to graphite2003" [puppet] - 10https://gerrit.wikimedia.org/r/754877 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [00:14:49] (03CR) 10Cwhite: [C: 03+1] Revert "graphite: check graphite2003 metrics" [puppet] - 10https://gerrit.wikimedia.org/r/754876 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [00:15:10] (03CR) 10Cwhite: [C: 03+1] wmnet: move writes to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754875 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [00:15:30] (03CR) 10Cwhite: [C: 03+1] wmnet: move reads to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754874 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [00:15:34] (03CR) 10Jdlrobson: [C: 03+1] Update config for pilot wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) (owner: 10Clare Ming) [00:15:49] (03CR) 10Cwhite: [C: 03+1] Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754879 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [00:17:06] (03CR) 10Cwhite: [C: 03+1] hieradata: use / as miscweb health check [puppet] - 10https://gerrit.wikimedia.org/r/754881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:19:27] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7102 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [00:21:49] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11205 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [00:28:25] RoanKattouw: still deploying? [00:28:35] azwiki still not live [00:29:19] Sorry, IRL distractions happened [00:29:20] I'm back now [00:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:08] didn't realize I was away for 20 mins, yikes, very sorry everyone [00:30:26] No problem :) [00:30:31] (03CR) 10Catrope: [C: 03+2] Don't use array keys for OOUI [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754917 (https://phabricator.wikimedia.org/T299463) (owner: 10Zabe) [00:30:35] (03CR) 10Catrope: [C: 03+2] Don't use array keys for OOUI in AbuseFilterViewDiff [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754918 (https://phabricator.wikimedia.org/T299463) (owner: 10Zabe) [00:30:40] (03CR) 10Catrope: [C: 03+2] Enable wikis to customize the syntax used for replies [extensions/DiscussionTools] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754915 (https://phabricator.wikimedia.org/T259864) (owner: 10Bartosz Dziewoński) [00:30:44] (03CR) 10Catrope: [C: 03+2] Ensure the marker appears in a reasonable place when replying with a bullet [extensions/DiscussionTools] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754916 (https://phabricator.wikimedia.org/T259864) (owner: 10Bartosz Dziewoński) [00:30:47] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755026|azwiki: Change alias Q to QA for the draft namespace (T299332)]] (duration: 00m 53s) [00:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:51] T299332: Add draft namespace on Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T299332 [00:31:10] (03PS3) 10Catrope: Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754914 (owner: 104nn1l2) [00:31:15] (03CR) 10Catrope: Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754914 (owner: 104nn1l2) [00:31:19] (03CR) 10Catrope: [C: 03+2] Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754914 (owner: 104nn1l2) [00:31:30] SCardenasM +1 [00:32:58] (03Merged) 10jenkins-bot: Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754914 (owner: 104nn1l2) [00:33:33] nn1l2: commonswiki revert is ready for testing [00:34:01] (03PS2) 10Catrope: Use namespaced CentralAuthUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752308 (https://phabricator.wikimedia.org/T298840) (owner: 10Zabe) [00:34:02] Does it really need test? [00:34:10] How should I test it? [00:34:32] (03CR) 10Catrope: [C: 03+2] Use namespaced CentralAuthUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752308 (https://phabricator.wikimedia.org/T298840) (owner: 10Zabe) [00:34:52] Looking at the patch I guess it can go out without, I'll sync it now [00:35:23] Please see Lucas's comment on Phabricator: https://phabricator.wikimedia.org/T299247#7628032 [00:35:26] Thanks! [00:35:32] (03Merged) 10jenkins-bot: Use namespaced CentralAuthUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752308 (https://phabricator.wikimedia.org/T298840) (owner: 10Zabe) [00:35:56] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754914|Revert "commonswiki: Add peerj.com to wgCopyUploadsDomains whitelist"]] (duration: 00m 54s) [00:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:31] zabe: Your namespace CentralAuthUser patch is ready for testing, but there's probably nothing to test there either, is there? [00:36:57] no. I just keep an eye on logstash to see if stuff explodes. [00:38:05] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:38:35] !log catrope@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:752308|Use namespaced CentralAuthUser (T298840)]] (duration: 00m 54s) [00:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:39] T298840: Use namespaced CentralAuthUser - https://phabricator.wikimedia.org/T298840 [00:38:40] (03PS2) 10Catrope: Change TheWikipediaLibrary editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754054 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:38:50] (03CR) 10Catrope: [C: 03+2] Change TheWikipediaLibrary editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754054 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:43:09] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7016 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [00:43:18] (03Merged) 10jenkins-bot: Change TheWikipediaLibrary editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754054 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:43:35] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:19] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:34] SCardenasM: Your change is on mwdebug1002, please test (or tell me to go ahead without testing) [00:45:12] RoanKattouw: Thanks! [00:45:39] (03PS5) 10Catrope: [wmf-config] Deploy the cawiki test safety survey to production. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) (owner: 10Eigyan) [00:47:46] RoanKattouw: LGTM! [00:48:43] (03Merged) 10jenkins-bot: Don't use array keys for OOUI [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754917 (https://phabricator.wikimedia.org/T299463) (owner: 10Zabe) [00:48:45] (03CR) 10Catrope: [C: 03+2] [wmf-config] Deploy the cawiki test safety survey to production. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) (owner: 10Eigyan) [00:48:47] (03Merged) 10jenkins-bot: Don't use array keys for OOUI in AbuseFilterViewDiff [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754918 (https://phabricator.wikimedia.org/T299463) (owner: 10Zabe) [00:48:49] (03Merged) 10jenkins-bot: Enable wikis to customize the syntax used for replies [extensions/DiscussionTools] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754915 (https://phabricator.wikimedia.org/T259864) (owner: 10Bartosz Dziewoński) [00:48:51] (03Merged) 10jenkins-bot: Ensure the marker appears in a reasonable place when replying with a bullet [extensions/DiscussionTools] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754916 (https://phabricator.wikimedia.org/T259864) (owner: 10Bartosz Dziewoński) [00:49:28] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754054|Change TheWikipediaLibrary editcount (T288070)]] (duration: 00m 53s) [00:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:33] T288070: Deploy The Wikipedia Library Echo notification with 50,000 edit count threshold - https://phabricator.wikimedia.org/T288070 [00:49:35] (03Merged) 10jenkins-bot: [wmf-config] Deploy the cawiki test safety survey to production. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) (owner: 10Eigyan) [00:51:31] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: wdqs1010, build2001, labstore1006, miscweb1002, labstore1007 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [00:54:50] eigyan: Your cawiki patch is on mwdebug1002, please test there and give me a thumbs up or down [00:55:58] RoanKattouw thank you [00:57:24] zabe: Your AbuseFilter change is now finally on mwdebug1002 (Jenkins took 20 minutes), please test [00:57:46] MatmaRex: And your DiscussionTools changes too. I'm holding back the DT config change until after those go out [00:58:21] thanks, looking [00:58:49] RoanKattouw: the abusefilter patches lgtm, no more internal errors showing up [00:59:52] RoanKattouw: the DT changes are no-ops without the config change, so i can't test much here [01:00:04] i guess i can confirm i see the new code [01:00:10] OK, I'll merge the config change and then we can test them all together [01:00:26] (03PS4) 10Catrope: DiscussionTools: Use bullet indentation on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753192 (https://phabricator.wikimedia.org/T259864) (owner: 10Bartosz Dziewoński) [01:00:43] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/AbuseFilter/: Backport: [[gerrit:754917|Don't use array keys for OOUI (T299463)]] and [[gerrit:754918|Don't use array keys for OOUI in AbuseFilterViewDiff (T299463)]] (duration: 00m 54s) [01:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:48] T299463: Viewing Abusefilter history throws "Error: Cannot unpack array with string keys" - https://phabricator.wikimedia.org/T299463 [01:00:50] (03CR) 10Catrope: [C: 03+2] DiscussionTools: Use bullet indentation on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753192 (https://phabricator.wikimedia.org/T259864) (owner: 10Bartosz Dziewoński) [01:01:35] (03Merged) 10jenkins-bot: DiscussionTools: Use bullet indentation on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753192 (https://phabricator.wikimedia.org/T259864) (owner: 10Bartosz Dziewoński) [01:02:00] (replying on testwiki still works: https://test2.wikipedia.org/w/index.php?title=Talk:Main_Page&diff=480455&oldid=473191&diffmode=source) [01:02:33] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/DiscussionTools: Backport: [[gerrit:754915|Enable wikis to customize the syntax used for replies (T259864)]] and [[gerrit:754916|Ensure the marker appears in a reasonable place when replying with a bullet (T259864)]] (duration: 00m 53s) [01:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:37] T259864: Enable Reply Tool to adapt to indentation syntax used at ru.wiki - https://phabricator.wikimedia.org/T259864 [01:03:25] MatmaRex: OK your config patch is ready for testing [01:03:34] eigyan: Any luck testing your change? [01:04:30] RoanKattouw we are good to go [01:04:37] RoanKattouw: thanks, looks good https://ru.wikipedia.org/w/index.php?title=Обсуждение_участника:Matma_Rex&diff=next&oldid=109445133&diffmode=source [01:05:18] Alright, deploying both (separately), we should be done here in ~3 minutes [01:05:46] Once again, I apologize for the 20-minute delay earlier where IRL stuff came up and made me forget I was running this deployment [01:05:57] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753543|[wmf-config] Deploy the cawiki test safety survey to production. (T296657)]] (duration: 00m 53s) [01:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:01] T296657: Deploy the cawiki test safety survey to production - https://phabricator.wikimedia.org/T296657 [01:07:14] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753192|DiscussionTools: Use bullet indentation on ruwiki (T259864)]] (duration: 00m 53s) [01:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:23] Alright, and we're done! Thanks everyone! [01:07:39] thanks for your help :) [01:07:57] Thanks! [01:08:41] thanks RoanKattouw [01:08:55] RoanKattouw 💯 [01:33:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [01:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [01:40:10] (03PS2) 10Andrew Bogott: wmcs nfsclient: remove a long-absented mount [puppet] - 10https://gerrit.wikimedia.org/r/754991 [01:42:11] (03CR) 10jerkins-bot: [V: 04-1] wmcs nfsclient: remove a long-absented mount [puppet] - 10https://gerrit.wikimedia.org/r/754991 (owner: 10Andrew Bogott) [02:01:37] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11130 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [02:15:04] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10tstarling) There is openssl_digest() which presumably has hardware acceleration and can do SHA-256 in 2.1 seconds per gigabyte. But its input is a single... [02:15:56] (03PS1) 10Andrew Bogott: Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) [02:17:58] (03CR) 10jerkins-bot: [V: 04-1] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [02:36:00] (03PS1) 10Andrew Bogott: no-op patch for testing CI [puppet] - 10https://gerrit.wikimedia.org/r/755059 [02:39:05] (03PS3) 10Andrew Bogott: cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) [02:40:55] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [02:51:31] (03PS4) 10Andrew Bogott: cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) [02:54:51] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:40] (03PS1) 10Andrew Bogott: Add dummy password for profile::openstack::eqiad1::cinder::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/755060 [03:01:56] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add dummy password for profile::openstack::eqiad1::cinder::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/755060 (owner: 10Andrew Bogott) [03:04:52] (03PS1) 10Andrew Bogott: Another dummy password for eqiad backup services in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/755061 [03:05:07] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Another dummy password for eqiad backup services in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/755061 (owner: 10Andrew Bogott) [03:23:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [03:29:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [03:39:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [03:55:59] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:13:36] (03PS3) 10Jcrespo: mediabackups: Backup s8 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754026 (https://phabricator.wikimedia.org/T262668) [04:14:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [04:15:27] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s8 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754026 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [04:16:49] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:19:07] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:19:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [04:28:43] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:03] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Codfw first pass finished for all wikis, this is the percentage of errors: {P18787} The ones with high number of err... [04:52:57] (03PS2) 10Andrew Bogott: no-op patch for testing CI [puppet] - 10https://gerrit.wikimedia.org/r/755059 [04:54:13] (03Abandoned) 10Andrew Bogott: no-op patch for testing CI [puppet] - 10https://gerrit.wikimedia.org/r/755059 (owner: 10Andrew Bogott) [04:58:23] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:56] (03CR) 10Andrew Bogott: "CI is broken but I'm still hoping you'll check this over for me" [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [05:16:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:42:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:43:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:47:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:47:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:47:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:48:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [05:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [05:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [05:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [05:49:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [05:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T285149)', diff saved to https://phabricator.wikimedia.org/P18788 and previous config saved to /var/cache/conftool/dbconfig/20220119-054924-marostegui.json [05:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:28] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [05:50:07] (03PS1) 10Marostegui: Revert "pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754920 [05:50:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T285149)', diff saved to https://phabricator.wikimedia.org/P18789 and previous config saved to /var/cache/conftool/dbconfig/20220119-055051-marostegui.json [05:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:10] (03CR) 10Marostegui: [C: 03+2] Revert "pc2011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754920 (owner: 10Marostegui) [05:58:59] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:59:29] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:01:17] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:05:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18790 and previous config saved to /var/cache/conftool/dbconfig/20220119-060555-marostegui.json [06:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18791 and previous config saved to /var/cache/conftool/dbconfig/20220119-062100-marostegui.json [06:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:19] (03PS1) 10Kosta Harlan: Fix TopicMenuSelectWidget after OOUI change [extensions/Flow] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754921 (https://phabricator.wikimedia.org/T299473) [06:31:03] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:28] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Start add image experiment for desktop users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752657 (https://phabricator.wikimedia.org/T298122) (owner: 10Kosta Harlan) [06:36:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T285149)', diff saved to https://phabricator.wikimedia.org/P18792 and previous config saved to /var/cache/conftool/dbconfig/20220119-063605-marostegui.json [06:36:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:36:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:10] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18793 and previous config saved to /var/cache/conftool/dbconfig/20220119-063613-marostegui.json [06:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18794 and previous config saved to /var/cache/conftool/dbconfig/20220119-063739-marostegui.json [06:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:30] (03PS1) 10Marostegui: mariadb: Disable notifications on a few s6 hosts [puppet] - 10https://gerrit.wikimedia.org/r/755256 (https://phabricator.wikimedia.org/T299479) [06:41:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on a few s6 hosts [puppet] - 10https://gerrit.wikimedia.org/r/755256 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [06:42:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2095.codfw.wmnet with OS bullseye [06:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P18795 and previous config saved to /var/cache/conftool/dbconfig/20220119-065244-marostegui.json [06:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s3 weights T263127', diff saved to https://phabricator.wikimedia.org/P18796 and previous config saved to /var/cache/conftool/dbconfig/20220119-065318-marostegui.json [06:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:22] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [07:07:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P18797 and previous config saved to /var/cache/conftool/dbconfig/20220119-070749-marostegui.json [07:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2089.codfw.wmnet with OS bullseye [07:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2117.codfw.wmnet with OS bullseye [07:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2124.codfw.wmnet with OS bullseye [07:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18799 and previous config saved to /var/cache/conftool/dbconfig/20220119-072253-marostegui.json [07:22:54] (03PS1) 10Elukey: role::pki::multirootca: add expiry for k8s_mlserve [puppet] - 10https://gerrit.wikimedia.org/r/755259 (https://phabricator.wikimedia.org/T298976) [07:22:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [07:22:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [07:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:58] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T285149)', diff saved to https://phabricator.wikimedia.org/P18800 and previous config saved to /var/cache/conftool/dbconfig/20220119-072301-marostegui.json [07:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33309/console" [puppet] - 10https://gerrit.wikimedia.org/r/755259 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [07:26:26] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::pki::multirootca: add expiry for k8s_mlserve [puppet] - 10https://gerrit.wikimedia.org/r/755259 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [07:31:27] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T285149)', diff saved to https://phabricator.wikimedia.org/P18801 and previous config saved to /var/cache/conftool/dbconfig/20220119-073129-marostegui.json [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:33] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:37:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:38:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] changeprop/api-gateway: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730559 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [07:38:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2095.codfw.wmnet with OS bullseye [07:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:49] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2089.codfw.wmnet with OS bullseye [07:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:13] (03Merged) 10jenkins-bot: changeprop/api-gateway: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730559 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [07:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18802 and previous config saved to /var/cache/conftool/dbconfig/20220119-074633-marostegui.json [07:46:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2117.codfw.wmnet with OS bullseye [07:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:17] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005423 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:49:30] (03CR) 10Elukey: [C: 03+2] helmfile.d: add 'cert-manager' namespace to ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/754981 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [07:50:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2124.codfw.wmnet with OS bullseye [07:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:20] (03PS1) 10Marostegui: db2114,db2076: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755309 (https://phabricator.wikimedia.org/T299479) [07:52:10] (03CR) 10Marostegui: [C: 03+2] db2114,db2076: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755309 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [07:52:16] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply on staging [07:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:19] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply on production [07:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2076.codfw.wmnet with OS bullseye [07:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2114.codfw.wmnet with OS bullseye [07:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [07:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [07:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:01] (03PS1) 10Giuseppe Lavagetto: changeprop/api gateway: correct the nutcracker image label usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/755310 [07:59:09] (03CR) 10jerkins-bot: [V: 04-1] changeprop/api gateway: correct the nutcracker image label usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/755310 (owner: 10Giuseppe Lavagetto) [08:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18803 and previous config saved to /var/cache/conftool/dbconfig/20220119-080138-marostegui.json [08:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:17] (03PS2) 10Giuseppe Lavagetto: changeprop/api gateway: correct the nutcracker image label usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/755310 [08:07:52] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [08:09:07] (03PS2) 10Jelto: admin: Shell account and analytics-privatedata-users for nray [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) [08:09:52] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [08:09:57] (03CR) 10jerkins-bot: [V: 04-1] admin: Shell account and analytics-privatedata-users for nray [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) (owner: 10Jelto) [08:10:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:34] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2010.codfw.wmnet with OS buster [08:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:11] (03PS3) 10Jelto: admin: Shell account and analytics-privatedata-users for nray [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) [08:14:59] (03CR) 10Jelto: admin: Shell account and analytics-privatedata-users for nray (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) (owner: 10Jelto) [08:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T285149)', diff saved to https://phabricator.wikimedia.org/P18804 and previous config saved to /var/cache/conftool/dbconfig/20220119-081643-marostegui.json [08:16:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:16:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:47] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T285149)', diff saved to https://phabricator.wikimedia.org/P18805 and previous config saved to /var/cache/conftool/dbconfig/20220119-081650-marostegui.json [08:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T285149)', diff saved to https://phabricator.wikimedia.org/P18806 and previous config saved to /var/cache/conftool/dbconfig/20220119-082318-marostegui.json [08:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:22] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:23:48] (03PS1) 10Marostegui: db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755311 (https://phabricator.wikimedia.org/T299479) [08:23:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2114.codfw.wmnet with OS bullseye [08:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:44] (03CR) 10Marostegui: [C: 03+2] db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755311 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [08:26:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2129.codfw.wmnet with OS bullseye [08:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2076.codfw.wmnet with OS bullseye [08:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] changeprop/api gateway: correct the nutcracker image label usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/755310 (owner: 10Giuseppe Lavagetto) [08:33:14] (03Merged) 10jenkins-bot: changeprop/api gateway: correct the nutcracker image label usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/755310 (owner: 10Giuseppe Lavagetto) [08:33:35] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply on staging [08:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:37] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply on production [08:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:21] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply on staging [08:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:24] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply on production [08:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:40] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply on staging [08:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:43] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply on production [08:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:33] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply on staging [08:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:36] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply on production [08:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:12] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use / as miscweb health check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:37:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:38:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18807 and previous config saved to /var/cache/conftool/dbconfig/20220119-083822-marostegui.json [08:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:14] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply on staging [08:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:16] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply on production [08:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:04] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync on staging [08:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:55] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply on production [08:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:59] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply on staging [08:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:45] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync on production [08:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:03] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply on production [08:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:05] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply on staging [08:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:07] (03PS1) 10Ayounsi: Force paramiko to 2.8.1 [software/homer] - 10https://gerrit.wikimedia.org/r/755312 (https://phabricator.wikimedia.org/T299482) [08:46:18] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync on production [08:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:39] !log disable v6 BGP to HE in eqiad for testing [08:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18808 and previous config saved to /var/cache/conftool/dbconfig/20220119-085327-marostegui.json [08:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:01] (03PS1) 10Majavah: hieradata: pcc: add tools and toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/755313 [08:56:40] (03PS1) 10Filippo Giunchedi: pontoon: use a valid string for cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/755314 [08:57:30] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use a valid string for cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/755314 (owner: 10Filippo Giunchedi) [08:59:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098 (s6,s7) for Bullseye reimage T299479', diff saved to https://phabricator.wikimedia.org/P18809 and previous config saved to /var/cache/conftool/dbconfig/20220119-085927-marostegui.json [08:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:32] T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479 [09:00:07] (03PS1) 10Marostegui: db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755315 (https://phabricator.wikimedia.org/T299479) [09:00:19] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:18] (03CR) 10Marostegui: [C: 03+2] db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755315 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [09:01:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2129.codfw.wmnet with OS bullseye [09:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1098.eqiad.wmnet with OS bullseye [09:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T285149)', diff saved to https://phabricator.wikimedia.org/P18811 and previous config saved to /var/cache/conftool/dbconfig/20220119-090832-marostegui.json [09:08:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:08:36] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18812 and previous config saved to /var/cache/conftool/dbconfig/20220119-090839-marostegui.json [09:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18813 and previous config saved to /var/cache/conftool/dbconfig/20220119-090905-marostegui.json [09:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:47] (03CR) 10Volans: [C: 03+1] "LGTM as a temporary workaround" [software/homer] - 10https://gerrit.wikimedia.org/r/755312 (https://phabricator.wikimedia.org/T299482) (owner: 10Ayounsi) [09:13:03] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:16:00] (03PS1) 10Arturo Borrero Gonzalez: cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 [09:16:02] (03PS1) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [09:17:07] (03CR) 10jerkins-bot: [V: 04-1] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [09:17:29] (03CR) 10jerkins-bot: [V: 04-1] cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 (owner: 10Arturo Borrero Gonzalez) [09:17:32] (03PS1) 10Marostegui: Revert "db2114,db2076: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754924 [09:20:56] !log migrate primary/secondary instances off ganeti1018 [09:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:37] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.3.9' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755024 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [09:22:52] (03CR) 10Marostegui: [C: 03+2] Revert "db2114,db2076: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754924 (owner: 10Marostegui) [09:24:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18816 and previous config saved to /var/cache/conftool/dbconfig/20220119-092410-marostegui.json [09:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:01] (03CR) 10Hashar: "recheck should trigger git fat pull" [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755028 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [09:27:13] (03CR) 10Hashar: [C: 03+2] Update Gerrit to 3.3.9 + plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755028 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [09:28:30] (03Merged) 10jenkins-bot: Merge tag 'v3.3.9' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755024 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [09:28:32] (03Merged) 10jenkins-bot: Update Gerrit to 3.3.9 + plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/755028 (https://phabricator.wikimedia.org/T240264) (owner: 10Hashar) [09:28:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1098.eqiad.wmnet with OS bullseye [09:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:22] !log enable v6 BGP to HE in eqiad for testing [09:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:29] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:33:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) (owner: 10Jelto) [09:33:54] (03PS1) 10Marostegui: Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754925 [09:34:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18817 and previous config saved to /var/cache/conftool/dbconfig/20220119-093411-root.json [09:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18818 and previous config saved to /var/cache/conftool/dbconfig/20220119-093416-root.json [09:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:39] (03CR) 10Marostegui: [C: 03+2] Revert "db1098: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/754925 (owner: 10Marostegui) [09:35:33] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:36:36] (03CR) 10Ayounsi: [C: 03+2] Force paramiko to 2.8.1 [software/homer] - 10https://gerrit.wikimedia.org/r/755312 (https://phabricator.wikimedia.org/T299482) (owner: 10Ayounsi) [09:39:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18819 and previous config saved to /var/cache/conftool/dbconfig/20220119-093915-marostegui.json [09:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:11] (03CR) 10Jelto: [C: 03+2] admin: Shell account and analytics-privatedata-users for nray [puppet] - 10https://gerrit.wikimedia.org/r/754954 (https://phabricator.wikimedia.org/T299186) (owner: 10Jelto) [09:40:57] (03PS4) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 [09:42:43] (03PS3) 10Jelto: admin: add slopes to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/754529 (https://phabricator.wikimedia.org/T299353) [09:43:47] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply on staging [09:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:50] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply on production [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:15] (03PS2) 10Arturo Borrero Gonzalez: cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 [09:44:17] (03PS2) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [09:44:43] (03PS3) 10Jelto: admin: Shell account and analytics-privatedata-users for mfossati [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) [09:44:46] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync on staging [09:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:13] (03CR) 10jerkins-bot: [V: 04-1] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [09:45:56] (03CR) 10jerkins-bot: [V: 04-1] cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 (owner: 10Arturo Borrero Gonzalez) [09:46:08] (03CR) 10Noa wmde: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [09:46:22] (03CR) 10Noa wmde: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [09:47:10] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply on production [09:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:13] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply on staging [09:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:41] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync on production [09:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:51] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply on production [09:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:53] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply on staging [09:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18820 and previous config saved to /var/cache/conftool/dbconfig/20220119-094914-root.json [09:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18821 and previous config saved to /var/cache/conftool/dbconfig/20220119-094920-root.json [09:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:24] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [09:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:29] (03PS3) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [09:49:33] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [09:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:06] I am going to upgrade the Gerrit replica for a patchset release [09:51:10] it is long overdue [09:51:28] (03CR) 10jerkins-bot: [V: 04-1] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [09:52:01] (03PS8) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [09:54:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18822 and previous config saved to /var/cache/conftool/dbconfig/20220119-095421-marostegui.json [09:54:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:54:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:25] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18823 and previous config saved to /var/cache/conftool/dbconfig/20220119-095428-marostegui.json [09:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_for_a_real_user for instructions how to " [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) (owner: 10Jelto) [09:54:43] !log hashar@deploy1002 Started deploy [gerrit/gerrit@a340940]: Gerrit to 3.3.9 on gerrit 2001 # T299451 [09:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:46] T299451: Upgrade Gerrit from 3.3.6 to 3.3.9 - https://phabricator.wikimedia.org/T299451 [09:54:52] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@a340940]: Gerrit to 3.3.9 on gerrit 2001 # T299451 (duration: 00m 09s) [09:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:56] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix TopicMenuSelectWidget after OOUI change [extensions/Flow] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754921 (https://phabricator.wikimedia.org/T299473) (owner: 10Kosta Harlan) [09:55:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/754529 (https://phabricator.wikimedia.org/T299353) (owner: 10Jelto) [09:55:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18824 and previous config saved to /var/cache/conftool/dbconfig/20220119-095555-marostegui.json [09:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:55] (03PS1) 10Ayounsi: Update changelog for v0.3.0 [software/homer] - 10https://gerrit.wikimedia.org/r/755323 [10:03:03] !log Upgraded gerrit-replica.wikimedia.org from 3.3.6 to 3.3.9 [10:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:11] (03PS6) 10Muehlenhoff: Make ganeti2026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747822 [10:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18825 and previous config saved to /var/cache/conftool/dbconfig/20220119-100418-root.json [10:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18826 and previous config saved to /var/cache/conftool/dbconfig/20220119-100424-root.json [10:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:46] (03PS3) 10Arturo Borrero Gonzalez: cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 [10:06:18] hashar: are the beta cluster sync jobs stuck? https://integration.wikimedia.org/ci/ [10:06:23] (03CR) 10jerkins-bot: [V: 04-1] cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 (owner: 10Arturo Borrero Gonzalez) [10:06:25] (03CR) 10jerkins-bot: [V: 04-1] Update changelog for v0.3.0 [software/homer] - 10https://gerrit.wikimedia.org/r/755323 (owner: 10Ayounsi) [10:06:26] ^ cc sergi0 [10:06:47] (03PS4) 10Arturo Borrero Gonzalez: cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 [10:07:11] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti2026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747822 (owner: 10Muehlenhoff) [10:08:27] (03CR) 10jerkins-bot: [V: 04-1] cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 (owner: 10Arturo Borrero Gonzalez) [10:08:59] (03PS5) 10Arturo Borrero Gonzalez: cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 [10:10:38] (03CR) 10jerkins-bot: [V: 04-1] cmd-checklist-runner: separate into its own module [puppet] - 10https://gerrit.wikimedia.org/r/755320 (owner: 10Arturo Borrero Gonzalez) [10:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18827 and previous config saved to /var/cache/conftool/dbconfig/20220119-101100-marostegui.json [10:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:47] (03PS2) 10Ayounsi: Update changelog for v0.3.0 [software/homer] - 10https://gerrit.wikimedia.org/r/755323 [10:11:55] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "overriding jenkins-bot because the failure is unrelated to this patch." [puppet] - 10https://gerrit.wikimedia.org/r/755320 (owner: 10Arturo Borrero Gonzalez) [10:14:32] (03CR) 10Jelto: [C: 03+2] admin: add slopes to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/754529 (https://phabricator.wikimedia.org/T299353) (owner: 10Jelto) [10:15:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: monitor: network tests: disable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/755325 [10:15:13] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync on production [10:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:54] (03CR) 10jerkins-bot: [V: 04-1] openstack: monitor: network tests: disable systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/755325 (owner: 10Arturo Borrero Gonzalez) [10:17:30] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply on staging [10:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:33] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply on production [10:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "overriding jenkins-bot because the CI failure is unrelated to this patch." [puppet] - 10https://gerrit.wikimedia.org/r/755325 (owner: 10Arturo Borrero Gonzalez) [10:17:49] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [10:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:59] (03PS1) 10Filippo Giunchedi: WIP [puppet] - 10https://gerrit.wikimedia.org/r/755326 [10:18:03] (03PS1) 10Filippo Giunchedi: prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) [10:18:06] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply on production [10:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:10] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply on staging [10:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:13] (03CR) 10jerkins-bot: [V: 04-1] prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:19:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18828 and previous config saved to /var/cache/conftool/dbconfig/20220119-101922-root.json [10:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18829 and previous config saved to /var/cache/conftool/dbconfig/20220119-101927-root.json [10:19:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on production [10:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:59] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/755323 (owner: 10Ayounsi) [10:20:04] (03PS2) 10Filippo Giunchedi: prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) [10:20:10] (03CR) 10Ayounsi: [C: 03+2] Update changelog for v0.3.0 [software/homer] - 10https://gerrit.wikimedia.org/r/755323 (owner: 10Ayounsi) [10:20:37] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply on production [10:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:39] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply on staging [10:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:42] (03CR) 10jerkins-bot: [V: 04-1] prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:21:06] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on production [10:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:21] (03PS3) 10Filippo Giunchedi: prometheus: handle non-LVS service::catalog entries [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) [10:21:26] (03Abandoned) 10Filippo Giunchedi: WIP [puppet] - 10https://gerrit.wikimedia.org/r/755326 (owner: 10Filippo Giunchedi) [10:23:17] (03PS1) 10Hashar: gerrit: remove default index.autoReindexIfStale [puppet] - 10https://gerrit.wikimedia.org/r/755328 [10:23:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10Jelto) 05Open→03Resolved a:03Jelto @nray You should have access now. I'm closing this task. In case you have any problem, feel free to re-open the task. Please note:... [10:23:19] (03PS1) 10Hashar: gerrit: use default for index.batchThreads [puppet] - 10https://gerrit.wikimedia.org/r/755329 [10:23:21] (03Merged) 10jenkins-bot: Update changelog for v0.3.0 [software/homer] - 10https://gerrit.wikimedia.org/r/755323 (owner: 10Ayounsi) [10:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18830 and previous config saved to /var/cache/conftool/dbconfig/20220119-102604-marostegui.json [10:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:44] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Sérgio Lopes - https://phabricator.wikimedia.org/T299353 (10Jelto) 05In progress→03Resolved a:03Jelto @SLopes-WMF You should have access now. I'm closing this task. In case you have any problem, feel free to re-open the ta... [10:27:29] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33314/console" [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:28:44] (03CR) 10Noa wmde: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde) [10:29:42] (03CR) 10Noa wmde: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755330 (https://phabricator.wikimedia.org/T296382) (owner: 10Noa wmde) [10:30:46] (03CR) 10Filippo Giunchedi: [V: 03+1] "See PCC for helm-charts change, which has discovery.wmnet entries but no svc.SITE.wmnet entries" [puppet] - 10https://gerrit.wikimedia.org/r/755327 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:32:41] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:34:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18831 and previous config saved to /var/cache/conftool/dbconfig/20220119-103425-root.json [10:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18832 and previous config saved to /var/cache/conftool/dbconfig/20220119-103431-root.json [10:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:24] (03PS1) 10Ayounsi: Release v2.3.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/755331 [10:36:44] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/755331 (owner: 10Ayounsi) [10:37:19] (03PS1) 10Arturo Borrero Gonzalez: cmd_checklist_runner: fix file name [puppet] - 10https://gerrit.wikimedia.org/r/755332 [10:37:33] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Release v2.3.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/755331 (owner: 10Ayounsi) [10:38:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cmd_checklist_runner: fix file name [puppet] - 10https://gerrit.wikimedia.org/r/755332 (owner: 10Arturo Borrero Gonzalez) [10:38:27] (03CR) 10Jelto: [C: 03+2] admin: Shell account and analytics-privatedata-users for mfossati [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) (owner: 10Jelto) [10:38:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2010.codfw.wmnet [10:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:01] (03PS1) 10Elukey: Expand helmfile_namespace_certs to support the ml-serve use case [deployment-charts] - 10https://gerrit.wikimedia.org/r/755333 (https://phabricator.wikimedia.org/T298976) [10:39:14] !log ayounsi@deploy1002 Started deploy [homer/deploy@d1fbc5c]: Homer release v0.3.0 [10:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:20] (03PS4) 10Jelto: admin: Shell account and analytics-privatedata-users for mfossati [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) [10:40:41] !log ayounsi@deploy1002 Finished deploy [homer/deploy@d1fbc5c]: Homer release v0.3.0 (duration: 01m 26s) [10:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:52] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2011.codfw.wmnet with OS buster [10:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T285149)', diff saved to https://phabricator.wikimedia.org/P18833 and previous config saved to /var/cache/conftool/dbconfig/20220119-104109-marostegui.json [10:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:13] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:41:35] (03CR) 10Jelto: [C: 03+2] admin: Shell account and analytics-privatedata-users for mfossati [puppet] - 10https://gerrit.wikimedia.org/r/754955 (https://phabricator.wikimedia.org/T299343) (owner: 10Jelto) [10:42:34] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Release v0.3.0 - ayounsi@cumin1001 [10:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Release v0.3.0 - ayounsi@cumin1001 [10:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:24] (03PS1) 10Elukey: Add cert-manager settings for the ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755334 (https://phabricator.wikimedia.org/T298976) [10:46:47] jouncebot: next [10:46:47] In 1 hour(s) and 13 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1200) [10:46:50] 10SRE, 10Infrastructure-Foundations, 10netops: Paramiko > 2.8.1 incompatibility with some Juniper devices - https://phabricator.wikimedia.org/T299482 (10ayounsi) 05Open→03Resolved a:03ayounsi Workaround pushed. [10:48:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics clients for mfossati - https://phabricator.wikimedia.org/T299343 (10Jelto) 05In progress→03Resolved a:03Jelto @mfossati you should have access now. I'm closing this task. In case you have any problem, feel free to re-open... [10:49:15] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "graphite: check graphite2003 metrics" [puppet] - 10https://gerrit.wikimedia.org/r/754876 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [10:49:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18834 and previous config saved to /var/cache/conftool/dbconfig/20220119-104929-root.json [10:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18835 and previous config saved to /var/cache/conftool/dbconfig/20220119-104934-root.json [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:38] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move reads to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754874 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [10:50:28] (03PS2) 10Elukey: Add cert-manager settings for the ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755334 (https://phabricator.wikimedia.org/T298976) [10:50:51] 10SRE, 10Infrastructure-Foundations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) Juniper bumped their recommended version to at least Junos 20 on a lot of platforms. * pfw: T295691 * cr: T295690 * mr: T278289 [10:50:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, and I think we want to start with this before hywiki and cebwiki, since it’s the wiki with the smallest number of wbc_entity_usage r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755322 (https://phabricator.wikimedia.org/T296383) (owner: 10Noa wmde) [10:54:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:54:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:54:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:55:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:55:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18836 and previous config saved to /var/cache/conftool/dbconfig/20220119-105523-marostegui.json [10:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:31] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:56:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [10:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18837 and previous config saved to /var/cache/conftool/dbconfig/20220119-105640-marostegui.json [10:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:49] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move writes to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/754875 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [10:56:55] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "profile: move statsd writes to graphite2003" [puppet] - 10https://gerrit.wikimedia.org/r/754877 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [10:58:51] !log flip graphite back to eqiad - T299383 [10:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:55] T299383: Move graphite back to eqiad - https://phabricator.wikimedia.org/T299383 [10:59:13] * elukey imagines Filippo flipping graphite on a table [11:00:18] haha! [11:00:29] like a tiktok meme [11:01:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [11:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:38] (03PS1) 10Vgutierrez: envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) [11:04:26] (03CR) 10jerkins-bot: [V: 04-1] envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18838 and previous config saved to /var/cache/conftool/dbconfig/20220119-110433-root.json [11:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18839 and previous config saved to /var/cache/conftool/dbconfig/20220119-110438-root.json [11:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:21] (03CR) 10Ayounsi: remove references to centrallog2001 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [11:05:48] hmm [11:05:50] error during compilation: Could not find resource 'Labstore::Nfs_mount[project-on-labstore-secondary]' in parameter 'require' (file: /srv/workspace/puppet/modules/puppet_compiler/manifests/init.pp, line: 35) [11:06:04] alright things seem to be working as expected, I'll proceed with mw config [11:06:37] I'm getting this error on a totally unrelated CR.. temporary issue? [11:07:03] (03PS1) 10Vgutierrez: cache::envoy: Set the delayed_close_timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755340 (https://phabricator.wikimedia.org/T271421) [11:07:15] (03CR) 10Vgutierrez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:07:22] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754879 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [11:07:51] vgutierrez: FWIW I used the compiler like an hour ago and it worked fine [11:08:01] gotcha [11:08:03] thx [11:08:23] (03Merged) 10jenkins-bot: Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754879 (https://phabricator.wikimedia.org/T299383) (owner: 10Filippo Giunchedi) [11:09:23] (03CR) 10jerkins-bot: [V: 04-1] cache::envoy: Set the delayed_close_timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755340 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:09:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add bullseye build [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/754902 (owner: 10Giuseppe Lavagetto) [11:09:59] same issue :/ [11:09:59] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33315/console" [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:10:05] but pcc is happy [11:10:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/749152 (owner: 10Jbond) [11:10:24] so I assume that something it's wrong with that NFS mount in our integration environment [11:11:35] jouncebot: next [11:11:35] In 0 hour(s) and 48 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1200) [11:11:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18840 and previous config saved to /var/cache/conftool/dbconfig/20220119-111144-marostegui.json [11:11:47] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33316/console" [puppet] - 10https://gerrit.wikimedia.org/r/755340 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:53] !log oblivian@deploy1002 Started deploy [docker-pkg/deploy@536f77a]: redeploy of 3.0.2, in preparation for deployment on build2001 [11:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:01] !log filippo@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:754879|Revert "ProductionServices: use graphite2003 for statsd" (T299383)]] (duration: 02m 09s) [11:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:05] T299383: Move graphite back to eqiad - https://phabricator.wikimedia.org/T299383 [11:12:35] Amir1: thank you so much for deploy-commands, so helpful in reducing friction [11:12:54] !log oblivian@deploy1002 Finished deploy [docker-pkg/deploy@536f77a]: redeploy of 3.0.2, in preparation for deployment on build2001 (duration: 01m 00s) [11:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:11] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add build2001 as a target [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/754903 (owner: 10Giuseppe Lavagetto) [11:13:52] ^^ [11:14:03] !log oblivian@deploy1002 Started deploy [docker-pkg/deploy@62a5e87]: redeploy of 3.0.2, including build2001 [11:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:08] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:15:11] (03PS1) 10Ayounsi: Update automatic Icinga LLDP hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/755342 [11:15:43] !log add ganeti2026 to Ganeti codfw cluster T282603 [11:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:47] T282603: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 [11:15:54] (03CR) 10Ayounsi: [C: 03+1] C:monitoring: uyse ['lldp']['parent'] instead of lldp_parent [puppet] - 10https://gerrit.wikimedia.org/r/749152 (owner: 10Jbond) [11:16:39] (03CR) 10DCausse: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [11:16:46] (03PS2) 10Ayounsi: C:monitoring: use ['lldp']['parent'] instead of lldp_parent [puppet] - 10https://gerrit.wikimedia.org/r/749152 (owner: 10Jbond) [11:17:05] (03CR) 10jerkins-bot: [V: 04-1] Update automatic Icinga LLDP hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi) [11:17:33] (03CR) 10Ayounsi: [C: 03+1] "Might be worth running PCC on all hosts once the parent change is merged." [puppet] - 10https://gerrit.wikimedia.org/r/749153 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [11:18:41] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi) [11:19:00] (03CR) 10jerkins-bot: [V: 04-1] C:monitoring: use ['lldp']['parent'] instead of lldp_parent [puppet] - 10https://gerrit.wikimedia.org/r/749152 (owner: 10Jbond) [11:19:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18842 and previous config saved to /var/cache/conftool/dbconfig/20220119-111937-root.json [11:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18843 and previous config saved to /var/cache/conftool/dbconfig/20220119-111942-root.json [11:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:24] (03CR) 10jerkins-bot: [V: 04-1] Update automatic Icinga LLDP hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi) [11:20:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) 05Stalled→03Declined Not needed anymore. [11:26:24] !log bounce navtiming on webperf1001 - T299383 [11:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:30] T299383: Move graphite back to eqiad - https://phabricator.wikimedia.org/T299383 [11:26:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18844 and previous config saved to /var/cache/conftool/dbconfig/20220119-112649-marostegui.json [11:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:16] !log bounce superset on an-tool1010 - T299383 [11:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:34] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2011.codfw.wmnet with OS buster [11:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:44] !log bounce superset on an-tool1005 - T299383 [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:51] (03PS1) 10Vgutierrez: cache::envoy: Decrease upstream idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755343 (https://phabricator.wikimedia.org/T271421) [11:29:57] godog: if you want to see its known issues, make sure to click on "report problems" at the bottom [11:30:38] Amir1: haha! I'm not proud but I've half-recognized the url [11:31:12] ugh, I need to manually bypass checks and put it in w.wiki [11:31:58] hehehe [11:32:00] gotta go to lunch [11:32:06] totally-not-rickroll.toolforge.org [11:32:17] taavi: smart [11:32:30] !log oblivian@deploy1002 Finished deploy [docker-pkg/deploy@62a5e87]: redeploy of 3.0.2, including build2001 (duration: 18m 27s) [11:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18845 and previous config saved to /var/cache/conftool/dbconfig/20220119-113440-root.json [11:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18846 and previous config saved to /var/cache/conftool/dbconfig/20220119-113445-root.json [11:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:07] !log rebalance ganeti group D in codfw after adding ganeti2026 T282603 [11:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:12] T282603: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 [11:36:12] (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Decrease upstream idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755343 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:38:51] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2012.codfw.wmnet with OS buster [11:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:32] taavi: https://gerrit.wikimedia.org/r/c/labs/tools/deploy-commands/+/755344 [11:41:51] lolol [11:41:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18847 and previous config saved to /var/cache/conftool/dbconfig/20220119-114154-marostegui.json [11:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:58] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [11:41:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:42:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [11:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [11:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:42:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T285149)', diff saved to https://phabricator.wikimedia.org/P18848 and previous config saved to /var/cache/conftool/dbconfig/20220119-114237-marostegui.json [11:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:31] deployed [11:45:05] (03PS1) 10ArielGlenn: update wme html dumps downloader to use JWT auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/755345 (https://phabricator.wikimedia.org/T273585) [11:45:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T285149)', diff saved to https://phabricator.wikimedia.org/P18849 and previous config saved to /var/cache/conftool/dbconfig/20220119-114552-marostegui.json [11:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:06] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikip [11:48:06] /v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.192.32.151:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.192.32.151:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/ [11:49:14] 10SRE, 10Scap: scap fails deployments on bullseye/python 3.9 - https://phabricator.wikimedia.org/T299501 (10Joe) [11:49:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18850 and previous config saved to /var/cache/conftool/dbconfig/20220119-114944-root.json [11:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18851 and previous config saved to /var/cache/conftool/dbconfig/20220119-114949-root.json [11:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:04] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:51:07] (03CR) 10Lucas Werkmeister (WMDE): "LGTM, but would prefer to do this after warwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755330 (https://phabricator.wikimedia.org/T296382) (owner: 10Noa wmde) [11:51:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable statement usage tracking for Armenian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755330 (https://phabricator.wikimedia.org/T296382) (owner: 10Noa wmde) [11:53:17] (03CR) 10Lucas Werkmeister (WMDE): Enable usage tracking for statement for cebwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [11:54:21] (03CR) 10Lucas Werkmeister (WMDE): Introduce $wmgEntityUsageModifierLimitsStatement (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [11:57:48] 10SRE, 10Scap: scap fails deployments on bullseye/python 3.9 - https://phabricator.wikimedia.org/T299501 (10Joe) The problem arises because pyyaml version 5.3.1 by default uses the safe loader for python objects, so to make the yaml load we need to change the code from: ` yaml.load(dump) ` to ` yaml.load(du... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1200) [12:00:04] cormacparle and sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:05] * cormacparle waves [12:00:18] o/ [12:00:25] \o [12:00:35] I “may I have your attention please” a new jouncebot phrase? I don’t remember it ^^ [12:00:51] never noticed it before [12:00:53] Lucas_WMDE: we'll need sync-world for our patch, it has i18n stuff in it [12:00:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18852 and previous config saved to /var/cache/conftool/dbconfig/20220119-120057-marostegui.json [12:00:59] I feel i saw it before... [12:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:14] Lucas_WMDE: do you want to deploy or should I? :) [12:01:32] urbanecm: if you know how to do sync-world I’d prefer you do it [12:01:41] Sure [12:01:41] I don’t think I’ve done it before [12:01:52] Or i can walk you through it :) [12:02:30] sure :) [12:02:41] (03CR) 10Urbanecm: [C: 03+2] Fix TopicMenuSelectWidget after OOUI change [extensions/Flow] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754921 (https://phabricator.wikimedia.org/T299473) (owner: 10Kosta Harlan) [12:02:45] what happened to sync-php-all? :P [12:03:05] (03CR) 10Urbanecm: [C: 03+2] Revert "Undo update to the way the search interface is set" [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753487 (owner: 10Cparle) [12:04:30] Lucas_WMDE: so, the first steps are the same. Merge, fetch to debug and test if possible. The catch with i18n changes is that they won't take any effect at the debug server, but anything else should continue to work. [12:05:26] ok [12:06:07] I +2'ed both changes, so now we need to wait for CI to merge them. [12:06:18] this commit has changes that I would normally sync separately (first extension.json so the hook no longer gets called, then sync the PHP file where the method is removed) [12:06:28] I’m guessing that’s not a thing with sync-world? [12:07:18] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:07:36] No. If you want to guarantee the order of changes, you can first sync individual files and then do sync-world to trigger a i18n cache rebuild. [12:07:50] ok [12:08:23] but then people would probably see preferences with missing i18n messages until the sync-world finishes [12:08:55] so it’s probably better to just do a sync-world here, and accept a possible brief spike of errors if the PHP file is synced before the extension.json? [12:09:15] (03PS1) 10Muehlenhoff: Make ganeti1023 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755346 (https://phabricator.wikimedia.org/T283036) [12:10:59] I think we won't avoid missing i18n for a few minutes even with sync-world, it's not atomic. [12:11:24] The hook runs on every page display, but missing i18n would just affect those who open preferences [12:11:44] So I'm for avoiding the exceptions by syncing the files individually first [12:12:16] hm, makes sense [12:12:38] * Lucas_WMDE wonders if we could ninja-sync the files partway through the sync-world [12:13:13] Probably not. [12:13:17] (and usually the sync-world not being atomic is not an issue, because it’s mainly run at the beginning of the train, before the new version is actually used by any wiki?) [12:13:27] Correct [12:13:31] yay [12:14:02] It is used to get code on the servers and then train deployment is "just" changing wikiversions.json to make wikis use the new code [12:14:23] makes sense [12:15:55] (03PS1) 10Btullis: Add single quotes to the wildcard for rsyncing nginxlogs [puppet] - 10https://gerrit.wikimedia.org/r/755347 (https://phabricator.wikimedia.org/T299358) [12:16:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18853 and previous config saved to /var/cache/conftool/dbconfig/20220119-121602-marostegui.json [12:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:53] (03PS1) 10Marostegui: db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755348 (https://phabricator.wikimedia.org/T299479) [12:18:04] (03CR) 10Marostegui: [C: 03+2] db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/755348 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [12:19:05] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2012.codfw.wmnet with OS buster [12:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1155.eqiad.wmnet with OS bullseye [12:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:30] (03Merged) 10jenkins-bot: Fix TopicMenuSelectWidget after OOUI change [extensions/Flow] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754921 (https://phabricator.wikimedia.org/T299473) (owner: 10Kosta Harlan) [12:20:30] (03Merged) 10jenkins-bot: Revert "Undo update to the way the search interface is set" [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753487 (owner: 10Cparle) [12:21:12] Both merged, great :) [12:21:27] Lucas_WMDE: can you deploy the Flow change first please? [12:21:45] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33319/console" [puppet] - 10https://gerrit.wikimedia.org/r/755347 (https://phabricator.wikimedia.org/T299358) (owner: 10Btullis) [12:23:55] urbanecm: can do [12:24:07] (I just uploaded https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/755349 because I noticed something that bothered me while looking at the change) [12:24:31] Thanks :9 [12:24:37] * :) [12:24:42] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add single quotes to the wildcard for rsyncing nginxlogs [puppet] - 10https://gerrit.wikimedia.org/r/755347 (https://phabricator.wikimedia.org/T299358) (owner: 10Btullis) [12:25:01] sergi0: the Flow change should be on mwdebug1001, can you test it [12:25:04] * Lucas_WMDE peeks at mwversions [12:25:11] should be testable on group0 wikis [12:25:24] (oh and I forgot a question mark up there sorry ^^) [12:26:16] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1023 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755346 (https://phabricator.wikimedia.org/T283036) (owner: 10Muehlenhoff) [12:26:16] Yes it should be testable on group0. Testing now. [12:27:28] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase201[12].codfw.wmnet [12:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:50] Everything looks good [12:30:05] alright, thanks [12:30:45] Thank you! And thanks for the following patch [12:31:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T285149)', diff saved to https://phabricator.wikimedia.org/P18854 and previous config saved to /var/cache/conftool/dbconfig/20220119-123106-marostegui.json [12:31:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:31:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:11] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [12:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T285149)', diff saved to https://phabricator.wikimedia.org/P18855 and previous config saved to /var/cache/conftool/dbconfig/20220119-123114-marostegui.json [12:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:51] (03CR) 10Vgutierrez: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) (owner: 10MMandere) [12:31:56] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/Flow/modules/flow/ui/widgets/mw.flow.ui.TopicMenuSelectWidget.js: Backport: [[gerrit:754921|Fix TopicMenuSelectWidget after OOUI change (T299473)]] (duration: 01m 08s) [12:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:59] T299473: [regression-wmf.18] Cannot read properties of undefined (reading 'length') - Flow pages are unresponsive - https://phabricator.wikimedia.org/T299473 [12:32:02] alright, Flow backport done [12:32:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T285149)', diff saved to https://phabricator.wikimedia.org/P18856 and previous config saved to /var/cache/conftool/dbconfig/20220119-123229-marostegui.json [12:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:00] cormacparle: the MediaSearch change should be on mwdebug1001, can you test it there? [12:33:05] sure [12:33:59] note the i18n change will not yet work (but everything else should) [12:34:07] yep looks good [12:34:10] (03CR) 10JMeybohm: "Small nit on the docs, but looks good apart from that!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755333 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [12:34:19] (03CR) 10JMeybohm: [C: 03+1] Expand helmfile_namespace_certs to support the ml-serve use case [deployment-charts] - 10https://gerrit.wikimedia.org/r/755333 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [12:35:20] ok [12:35:41] so… sync-file the extension.json first, probably? [12:35:46] that should be okay on its own [12:36:01] yes [12:36:35] and then we can just start sync-world (as there's only one other file left) [12:36:45] *nod* [12:36:48] syncing [12:36:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33320/console" [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) (owner: 10MMandere) [12:36:52] thanks [12:37:01] the command for sync-world is `scap sync-world 'Message'` [12:37:04] what’s the sync-world syntax? scap sync-world 'reason'? [12:37:08] ah, thanks :) [12:37:28] and then there’s a maintenance script to be run afterwards, I hear [12:37:30] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] cumin: Add cache::upload_envoy to cp aliases [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) (owner: 10MMandere) [12:38:13] there is - I can run that as soon as the php/json files are synced [12:38:15] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/MediaSearch/extension.json: Backport: [[gerrit:753487|Revert "Undo update to the way the search interface is set"]] (part 1) (duration: 01m 34s) [12:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:32] ok [12:38:37] then I think we’re ready for the sync-world [12:38:43] +1 [12:38:53] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport: [[gerrit:753487|Revert "Undo update to the way the search interface is set"]] (part 2) [12:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:17] 🎶 sync the world 🎶 make it a better place 🎶 [12:39:44] for MediaSearch and the settings interface [12:40:23] :D [12:41:15] (03PS2) 10MMandere: cumin: Add cache::upload_envoy to cp aliases [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) [12:43:05] (03CR) 10jerkins-bot: [V: 04-1] cumin: Add cache::upload_envoy to cp aliases [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) (owner: 10MMandere) [12:43:23] (03PS1) 10Lucas Werkmeister (WMDE): nginxlogs: Use shell to expand glob [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) [12:44:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:37] huh, l10n-update is already finished (duration 1m28s) [12:45:28] the whole script is remarkably faster than it used to be [12:46:36] sounds good ^^ [12:46:38] I'm still seeing the old preferences interface when I'm not on the debug server, is that expected? [12:46:51] I think so, since the sync is still ongoing [12:46:55] correct [12:47:00] it just finished sync-pull-masters [12:47:06] the !log above was just for when it started [12:47:23] there'll be a second !log when it finishes [12:47:32] 👍 [12:47:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18857 and previous config saved to /var/cache/conftool/dbconfig/20220119-124733-marostegui.json [12:47:36] when you are done I will upgrade Gerrit [12:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1155.eqiad.wmnet with OS bullseye [12:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:50:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:02] (03PS1) 10Marostegui: Revert "db1155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755366 [12:51:42] (03CR) 10Marostegui: [C: 03+2] Revert "db1155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/755366 (owner: 10Marostegui) [12:53:12] (03PS3) 10Noa wmde: Introduce $wmgEntityUsageModifierLimitsStatement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) [12:55:39] (03PS1) 10Marostegui: instances.yaml: Remove db1128 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/755353 (https://phabricator.wikimedia.org/T299344) [12:56:15] looks like it’s now busy rysncing several gigabytes to a few hundred apaches [12:56:28] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1128 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/755353 (https://phabricator.wikimedia.org/T299344) (owner: 10Marostegui) [12:56:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1128 from dbctl T299344', diff saved to https://phabricator.wikimedia.org/P18858 and previous config saved to /var/cache/conftool/dbconfig/20220119-125658-marostegui.json [12:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:02] T299344: Upgrade m1 to Bullseye - https://phabricator.wikimedia.org/T299344 [13:00:37] uh oh, one host is out of space [13:00:52] not sure how to tell which one [13:01:24] I think it must be one of mw2300, mw2289, mw1319, w1313, mw2254, mw1366, mw1420, mw1306? [13:01:30] they were trying to sync from mw1420 [13:01:59] maybe scap will print the hostname when it’s finished syncing the other ones [13:02:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18859 and previous config saved to /var/cache/conftool/dbconfig/20220119-130238-marostegui.json [13:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:58] it does not >:( [13:03:11] (03CR) 10Noa wmde: Introduce $wmgEntityUsageModifierLimitsStatement (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:03:12] oh wait [13:03:15] it’s mwdebug1001 that’s out of space?! [13:03:28] yeah I didn’t see that part in the message [13:03:48] df -h / on mwdebug1001 says Size 49G, Used 46G, Avail 0, Use 100% [13:03:59] opps :) [13:04:02] not sure what that discrepancy between Size and Used means but it seems to be out of space [13:04:13] urbanecm: `w` claims you’re the only other person on that server [13:04:16] what’d you do :P [13:04:23] nothing, i just logged in to do df -h :D [13:04:28] hrm [13:04:30] well then… [13:04:39] * Lucas_WMDE launches ncdu [13:04:57] Lucas_WMDE: the size difference between Size and Used is Linux's protection from an actual exhaustion of disk space [13:05:01] only roots can use the remaining space [13:05:26] !log lucaswerkmeister-wmde@mwdebug1001:~$ sudo -u www-data rm /tmp/URL*.urlupload_ # save space [13:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:38] 8.5G Avail now [13:05:55] those dated back to Jan 13, I hope they weren’t needed anymore [13:06:37] (03PS1) 10Jbond: p:puppet_compiler: fix pcc [puppet] - 10https://gerrit.wikimedia.org/r/755355 [13:06:54] there's also a bunch of old MW versions. But those are also present at deployment, so not removing them [13:07:27] (03CR) 10MMandere: [V: 03+2 C: 03+2] cumin: Add cache::upload_envoy to cp aliases [puppet] - 10https://gerrit.wikimedia.org/r/754975 (https://phabricator.wikimedia.org/T271421) (owner: 10MMandere) [13:07:46] (03PS4) 10Noa wmde: Enable usage tracking for statement for cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) [13:08:01] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport: [[gerrit:753487|Revert "Undo update to the way the search interface is set"]] (part 2) (duration: 29m 08s) [13:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:17] (03CR) 10Filippo Giunchedi: [C: 03+1] bump patch version to update plugins [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755033 (owner: 10Cwhite) [13:08:28] (03CR) 10jerkins-bot: [V: 04-1] p:puppet_compiler: fix pcc [puppet] - 10https://gerrit.wikimedia.org/r/755355 (owner: 10Jbond) [13:08:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:08:33] doing another scap pull on mwdebug1001 to ensure it has the latest everything now [13:08:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:50] Lucas_WMDE: urbanecm: maybe some old wmf branches have not be cleaned? [13:08:53] looks sync-world has finished now Lucas_WMDE ? [13:08:59] (taking longer than usual, I guess all the l10n files had their mtime updated and so they need to be synced again?) [13:09:08] hashar: sounds plausible [13:09:11] cormacparle: yes [13:09:21] as i said, they're at deployment host, so i'm not touching those [13:09:32] (03CR) 10Filippo Giunchedi: [C: 03+1] prepare for logstash 7.16.3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/755041 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [13:09:51] (03PS2) 10Jbond: p:puppet_compiler: fix pcc [puppet] - 10https://gerrit.wikimedia.org/r/755355 [13:09:54] awesome, everything seems to me working now anyway [13:10:01] thanks urbanecm and Lucas_WMDE ! [13:10:08] \o/ [13:10:09] yeah it has a bunch of old ones php-1.38.0-wmf.12 php-1.38.0-wmf.13 and php-1.38.0-wmf.16 [13:10:33] yeah, it looks like there’s space for ca. one more train on mwdebug1001 [13:10:34] they are supposed to be cleaned as part of running the train on tuesday [13:10:37] (03CR) 10Filippo Giunchedi: [C: 03+1] builder: add opensearch1 pbuilder hooks for logstash-plugins update [puppet] - 10https://gerrit.wikimedia.org/r/755043 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [13:10:38] since each wmf dir seems to be a bit over 5G [13:11:07] yup, it's a complete copy [13:11:11] so the cleaning's not working :)) [13:11:36] or they never got run [13:11:44] (scap pull on mwdebug1001 still running btw) [13:12:48] alright, now it’s done [13:12:58] cormacparle: feel free to run that maintenance script if you aren’t already [13:13:09] done [13:13:13] all is well [13:13:16] ok! [13:13:19] and now we have about 12G free at mwdebug [13:13:25] !log UTC morning backport+config window done [13:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:33] what else did you remove? [13:13:33] (03CR) 10Jbond: [C: 03+2] p:puppet_compiler: fix pcc [puppet] - 10https://gerrit.wikimedia.org/r/755355 (owner: 10Jbond) [13:13:40] nothing [13:13:48] the scap sync must've removed something [13:13:48] o_O [13:13:50] or hashar [13:13:54] ah, could be [13:14:14] I will clean the old branches when you are done [13:14:20] hashar: we're done :) [13:14:28] yup, go aheda [13:14:30] *ahead [13:14:30] (03PS2) 10Jbond: Update automatic Icinga LLDP hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi) [13:14:39] (and/or feel free to restart gerrit) [13:14:46] (03PS3) 10Jbond: C:monitoring: use ['lldp']['parent'] instead of lldp_parent [puppet] - 10https://gerrit.wikimedia.org/r/749152 [13:14:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:23] (03PS7) 10Jbond: blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [13:15:33] (03PS2) 10Jbond: envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:15:47] (03PS2) 10Jbond: cache::envoy: Set the delayed_close_timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/755340 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:16:52] !log Cleaning all branch with `scap clean --delete 1.38.0-wmf.13` apparently missed in previous train # T293958 T293959 [13:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:56] T293959: 1.38.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T293959 [13:16:56] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [13:17:22] FYI all i have fixed puppet CI, if you have a change you shoud be able to rebase on production to get things passing again [13:17:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T285149)', diff saved to https://phabricator.wikimedia.org/P18860 and previous config saved to /var/cache/conftool/dbconfig/20220119-131743-marostegui.json [13:17:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [13:17:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [13:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:47] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T285149)', diff saved to https://phabricator.wikimedia.org/P18861 and previous config saved to /var/cache/conftool/dbconfig/20220119-131750-marostegui.json [13:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:34] !log hashar@deploy1002 Pruned MediaWiki: 1.38.0-wmf.13 (duration: 03m 11s) [13:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T285149)', diff saved to https://phabricator.wikimedia.org/P18862 and previous config saved to /var/cache/conftool/dbconfig/20220119-131905-marostegui.json [13:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:17] !log Cleaning all branch with `scap clean --delete 1.38.0-wmf.12` apparently missed in previous train # T293958 T293959 [13:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:33] cannot delete non-empty directory: php-1.38.0-wmf.12/cache/l10n [13:19:33] pff [13:19:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:09] hashar: that message feels like an evergreen to me :)) [13:20:28] i think we have a task about it [13:20:32] must be some sudo issue [13:20:41] !log hashar@deploy1002 Pruned MediaWiki: 1.38.0-wmf.12 (duration: 01m 43s) [13:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:26] hashar: or rm -r vs rm -rf [13:22:32] !log hashar@deploy1002 Pruned MediaWiki: 1.38.0-wmf.16 (duration: 01m 32s) [13:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:59] I am deploying Gerrit [13:23:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:24:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:24:51] !log hashar@deploy1002 Started deploy [gerrit/gerrit@a340940]: Gerrit upgrade from 3.3.6 to 3.3.9 on gerrit1001 # T299451 [13:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:54] T299451: Upgrade Gerrit from 3.3.6 to 3.3.9 - https://phabricator.wikimedia.org/T299451 [13:24:58] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@a340940]: Gerrit upgrade from 3.3.6 to 3.3.9 on gerrit1001 # T299451 (duration: 00m 08s) [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:41] !log Restarting Gerrit [13:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:08] it is back up [13:28:14] our human monitoring did not complain [13:29:51] one WMDE colleague noticed, does that count as human monitoring :P [13:29:53] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [13:30:07] (it’s all working again though) [13:30:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:40] it is reasonably fast to boot indeed :] [13:30:50] Lucas_WMDE: my apologize to your colleague [13:34:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18863 and previous config saved to /var/cache/conftool/dbconfig/20220119-133410-marostegui.json [13:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [13:35:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [13:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T239814)', diff saved to https://phabricator.wikimedia.org/P18864 and previous config saved to /var/cache/conftool/dbconfig/20220119-133514-ladsgroup.json [13:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:18] T239814: Automate DB upgrades - https://phabricator.wikimedia.org/T239814 [13:35:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:34] hashar: I think I’ll interpret their response as “apology accepted” :P [13:36:08] Lucas_WMDE: awesome :-] I get I owe someone a beer/soda or similar :] [13:36:18] !log ladsgroup@cumin1001 START - Cookbook sre.mysql.upgrade for db1100.eqiad.wmnet [13:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:39:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:21] (03CR) 10DCausse: "PCC looks fine, wcqs system unit sees the --nginx-port param removed forcing the use of the blazegraph port wdqs one is left intact" [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [13:45:01] 10SRE: systemd job "Sync keys for Keystone fernet tokens to ${thishost}" potentially broken - https://phabricator.wikimedia.org/T299519 (10Michael) [13:48:24] 10SRE: microsites systemd job "Sync OS migration reports/overview" might be broken - https://phabricator.wikimedia.org/T299520 (10Michael) [13:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18865 and previous config saved to /var/cache/conftool/dbconfig/20220119-134915-marostegui.json [13:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:33] (03PS1) 10JMeybohm: Add scheduler_token to all k8s masters [labs/private] - 10https://gerrit.wikimedia.org/r/755389 (https://phabricator.wikimedia.org/T290967) [14:04:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T285149)', diff saved to https://phabricator.wikimedia.org/P18866 and previous config saved to /var/cache/conftool/dbconfig/20220119-140419-marostegui.json [14:04:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:04:23] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add scheduler_token to all k8s masters [labs/private] - 10https://gerrit.wikimedia.org/r/755389 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:04:24] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [14:04:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T285149)', diff saved to https://phabricator.wikimedia.org/P18867 and previous config saved to /var/cache/conftool/dbconfig/20220119-140433-marostegui.json [14:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:58] !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db1100.eqiad.wmnet [14:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33323/console" [puppet] - 10https://gerrit.wikimedia.org/r/754515 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:08:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T285149)', diff saved to https://phabricator.wikimedia.org/P18868 and previous config saved to /var/cache/conftool/dbconfig/20220119-140848-marostegui.json [14:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:11] (03PS2) 10Elukey: Expand helmfile_namespace_certs to support the ml-serve use case [deployment-charts] - 10https://gerrit.wikimedia.org/r/755333 (https://phabricator.wikimedia.org/T298976) [14:09:13] (03PS3) 10Elukey: Add cert-manager settings for the ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755334 (https://phabricator.wikimedia.org/T298976) [14:10:00] (03CR) 10Elukey: "Thanks a lot for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/755333 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [14:10:54] (03PS4) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [14:11:32] (03CR) 10jerkins-bot: [V: 04-1] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [14:13:32] (03PS5) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [14:15:26] (03CR) 10jerkins-bot: [V: 04-1] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [14:16:23] (03PS6) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [14:17:17] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:17:47] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:17:53] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Make disabled insecure API the default on kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/754515 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:18:16] (03CR) 10jerkins-bot: [V: 04-1] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [14:18:34] jbond: okay to merge "p:puppet_compiler: fix pcc (14b68e619d)" ? [14:19:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/755043 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [14:20:37] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755391 (https://phabricator.wikimedia.org/T294339) (owner: 10Awight) [14:20:51] (03CR) 10jerkins-bot: [V: 04-1] [beta] Maps will be provided by the maps-experiments kartotherian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755391 (https://phabricator.wikimedia.org/T294339) (owner: 10Awight) [14:21:05] jbond: merged (looks harmless :)) [14:21:27] (03PS2) 10Awight: [beta] Maps will be provided by the maps-experiments kartotherian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755391 (https://phabricator.wikimedia.org/T294339) [14:23:06] (03CR) 10Awight: [C: 03+2] "Deploying to beta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755391 (https://phabricator.wikimedia.org/T294339) (owner: 10Awight) [14:23:52] (03Merged) 10jenkins-bot: [beta] Maps will be provided by the maps-experiments kartotherian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755391 (https://phabricator.wikimedia.org/T294339) (owner: 10Awight) [14:23:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18869 and previous config saved to /var/cache/conftool/dbconfig/20220119-142353-marostegui.json [14:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [14:29:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2013.codfw.wmnet with OS buster [14:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [14:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [14:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [14:31:14] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki-httpd: add and configure mod_remoteip [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/754897 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [14:33:18] !log esams: upgrade varnish to 6.0.9 T298758 [14:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:22] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [14:33:36] !log disabled insecure API on all k8s masters - T290967 [14:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:39] T290967: kube-apiserver need to reach webhooks running inside of the cluster - https://phabricator.wikimedia.org/T290967 [14:34:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:34:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1018.eqiad.wmnet with OS buster [14:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:26] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS buster [14:35:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:13] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:38:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18870 and previous config saved to /var/cache/conftool/dbconfig/20220119-143858-marostegui.json [14:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:12] (03PS1) 10Jbond: P:microsites::os_reports: update rsync to use -r instead of * [puppet] - 10https://gerrit.wikimedia.org/r/755395 (https://phabricator.wikimedia.org/T299520) [14:42:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33324/console" [puppet] - 10https://gerrit.wikimedia.org/r/755395 (https://phabricator.wikimedia.org/T299520) (owner: 10Jbond) [14:44:56] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [14:50:19] (03CR) 10Jbond: "see comment" [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) (owner: 10Lucas Werkmeister (WMDE)) [14:53:00] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: microsites systemd job "Sync OS migration reports/overview" might be broken - https://phabricator.wikimedia.org/T299520 (10jbond) Thanks, this is not such a big issue for the os_reports as the '*' gets passed to the remotes rsync server... [14:53:04] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: microsites systemd job "Sync OS migration reports/overview" might be broken - https://phabricator.wikimedia.org/T299520 (10jbond) [14:53:13] (03CR) 10Lucas Werkmeister (WMDE): nginxlogs: Use shell to expand glob (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) (owner: 10Lucas Werkmeister (WMDE)) [14:54:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T285149)', diff saved to https://phabricator.wikimedia.org/P18871 and previous config saved to /var/cache/conftool/dbconfig/20220119-145402-marostegui.json [14:54:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:54:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:07] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [14:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18872 and previous config saved to /var/cache/conftool/dbconfig/20220119-145410-marostegui.json [14:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:03] !log robh@cumin1001 START - Cookbook sre.dns.netbox [14:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18873 and previous config saved to /var/cache/conftool/dbconfig/20220119-145525-marostegui.json [14:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:11] (03PS1) 10Jbond: P:openstack::base::keystone::fernet_keys: drop use of '*' in rsync [puppet] - 10https://gerrit.wikimedia.org/r/755396 (https://phabricator.wikimedia.org/T299519) [14:56:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33325/console" [puppet] - 10https://gerrit.wikimedia.org/r/755396 (https://phabricator.wikimedia.org/T299519) (owner: 10Jbond) [14:57:38] (03CR) 10Herron: remove references to centrallog2001 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [14:57:59] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/755395 (https://phabricator.wikimedia.org/T299520) (owner: 10Jbond) [15:00:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1018.eqiad.wmnet with OS buster [15:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:43] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS buster executed with errors: - ganeti1018 (*... [15:01:24] !log migrate primary/secondary instances off ganeti1022 [15:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:microsites::os_reports: update rsync to use -r instead of * [puppet] - 10https://gerrit.wikimedia.org/r/755395 (https://phabricator.wikimedia.org/T299520) (owner: 10Jbond) [15:04:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:17] (03PS1) 10Ladsgroup: Avoid double parsing [extensions/FlaggedRevs] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755406 (https://phabricator.wikimedia.org/T292300) [15:06:09] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: microsites systemd job "Sync OS migration reports/overview" might be broken - https://phabricator.wikimedia.org/T299520 (10jbond) 05Open→03Resolved a:03jbond updated [15:06:43] (03CR) 10Lucas Werkmeister (WMDE): nginxlogs: Use shell to expand glob (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) (owner: 10Lucas Werkmeister (WMDE)) [15:06:52] (03PS2) 10Lucas Werkmeister (WMDE): nginxlogs: Move rsync globs to --include/--exclude [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) [15:07:10] PROBLEM - SSH on contint1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:07:34] (03CR) 10Jbond: [C: 03+2] C:monitoring: use ['lldp']['parent'] instead of lldp_parent [puppet] - 10https://gerrit.wikimedia.org/r/749152 (owner: 10Jbond) [15:07:50] !log updating lldp parent fact [15:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:20] (03PS2) 10Jbond: lldp: drop legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/749153 (https://phabricator.wikimedia.org/T289679) [15:10:24] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2013.codfw.wmnet with OS buster [15:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18875 and previous config saved to /var/cache/conftool/dbconfig/20220119-151029-marostegui.json [15:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [15:15:05] (03PS3) 10Vgutierrez: envoy: Allow configuring delayed_closed_timeout [puppet] - 10https://gerrit.wikimedia.org/r/755338 (https://phabricator.wikimedia.org/T271421) [15:15:07] (03PS3) 10Vgutierrez: cache::envoy: Set the delayed_close_timeout to 20s [puppet] - 10https://gerrit.wikimedia.org/r/755340 (https://phabricator.wikimedia.org/T271421) [15:15:36] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:16:41] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2014.codfw.wmnet with OS buster [15:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:32] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/754991 (owner: 10Andrew Bogott) [15:19:08] RECOVERY - SSH on contint1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:19:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [15:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:30] (03CR) 10Andrew Bogott: [C: 03+2] wmcs nfsclient: remove a long-absented mount [puppet] - 10https://gerrit.wikimedia.org/r/754991 (owner: 10Andrew Bogott) [15:22:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) (owner: 10Lucas Werkmeister (WMDE)) [15:23:04] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/che [15:23:04] } (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.192.16.80:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.192.16.80:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:24:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [15:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:58] (03CR) 10Andrew Bogott: Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [15:25:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18876 and previous config saved to /var/cache/conftool/dbconfig/20220119-152534-marostegui.json [15:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:52] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [15:29:49] (03PS2) 10Andrew Bogott: Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) [15:34:32] (03PS2) 10Clare Ming: Update config for pilot wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) [15:36:25] (03CR) 10jerkins-bot: [V: 04-1] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [15:36:28] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:01] (03CR) 10DCausse: [V: 03+2 C: 03+2] Revert "Add repository-swift plugin" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/741734 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [15:39:37] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:40:27] jouncebot: nowandnext [15:40:27] No deployments scheduled for the next 3 hour(s) and 19 minute(s) [15:40:28] In 3 hour(s) and 19 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1900) [15:40:28] In 3 hour(s) and 19 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1900) [15:40:33] ooof, nice [15:40:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18877 and previous config saved to /var/cache/conftool/dbconfig/20220119-154039-marostegui.json [15:40:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:40:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:44] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [15:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18878 and previous config saved to /var/cache/conftool/dbconfig/20220119-154046-marostegui.json [15:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:49] PROBLEM - SSH on contint1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:59] !log cp5005,cp4025: upgrade varnish to 6.0.9 T298758 [15:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:02] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [15:42:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18879 and previous config saved to /var/cache/conftool/dbconfig/20220119-154201-marostegui.json [15:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:31] (03PS1) 10Ppchelko: Add temporary entrypoint for settings benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 [15:45:29] (03CR) 10CDanis: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/754880 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [15:46:37] (03PS2) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) [15:48:13] !log installing tiff security updates on stretch [15:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:21] (03CR) 10jerkins-bot: [V: 04-1] Add temporary entrypoint for settings benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 (owner: 10Ppchelko) [15:50:23] (03PS2) 10Ppchelko: Add temporary entrypoint for settings benchmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 [15:50:38] (03CR) 10Elukey: [C: 03+2] Expand helmfile_namespace_certs to support the ml-serve use case [deployment-charts] - 10https://gerrit.wikimedia.org/r/755333 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:50:42] (03CR) 10Elukey: [C: 03+2] Add cert-manager settings for the ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/755334 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [15:51:22] (03CR) 10Jbond: [C: 03+2] lldp: drop legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/749153 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [15:53:57] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb={GET,LIST,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:53:59] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:54:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:54:49] (03PS1) 10Jbond: P:docker::reporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/755403 [15:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:09] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:55:09] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:55:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33328/console" [puppet] - 10https://gerrit.wikimedia.org/r/755403 (owner: 10Jbond) [15:57:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18881 and previous config saved to /var/cache/conftool/dbconfig/20220119-155706-marostegui.json [15:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:22] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10wiki_willy) a:03Cmjohnson Hi @Cmjohnson - just a heads up, this one is a bit higher priority. Thanks, Willy [15:58:18] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2014.codfw.wmnet with OS buster [15:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:59] (03PS3) 10Jbond: Update automatic Icinga LLDP hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi) [15:59:23] (03CR) 10Jbond: [C: 03+1] "rebased and lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/755342 (owner: 10Ayounsi) [16:00:52] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase201[134].codfw.wmnet [16:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2015.codfw.wmnet with OS buster [16:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:48] (03PS1) 10Elukey: admin_ng: update tls secret settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/755431 (https://phabricator.wikimedia.org/T298976) [16:11:12] (03PS2) 10Elukey: admin_ng: update knative's tls secret settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/755431 (https://phabricator.wikimedia.org/T298976) [16:12:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18882 and previous config saved to /var/cache/conftool/dbconfig/20220119-161212-marostegui.json [16:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:43] (03PS1) 10Phuedx: profile::manifests::cache::kafka::webrequest: Log Sec-CH-UA* headers [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) [16:19:49] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:49] (03CR) 10Ppchelko: "See discussion on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/754911/12 for more context" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 (owner: 10Ppchelko) [16:22:27] something happened on contint1001 but I am not quite sure what [16:22:41] ssh went unresponsive triggering a notification above [16:23:11] large CPU usage and memory exhaustion apparently ( https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=contint1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ci&from=now-3h&to=now ) [16:26:37] (03CR) 10Elukey: [C: 03+2] admin_ng: update knative's tls secret settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/755431 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [16:27:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T285149)', diff saved to https://phabricator.wikimedia.org/P18883 and previous config saved to /var/cache/conftool/dbconfig/20220119-162717-marostegui.json [16:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:22] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [16:30:20] (03CR) 10Elukey: [V: 03+2 C: 03+2] admin_ng: update knative's tls secret settings for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/755431 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [16:31:59] (03PS6) 10Majavah: rabbitmq: Add support for listening on TLS [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) [16:32:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:23] (03PS1) 10Muehlenhoff: Make ganeti1024 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755440 (https://phabricator.wikimedia.org/T283036) [16:33:42] (03CR) 10Majavah: rabbitmq: Add support for listening on TLS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [16:35:01] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [16:36:46] !log marking contint1001.wikimedia.org as offline in Jenkins since it is dramatically overloaded T299542 [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:50] T299542: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 [16:37:19] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [16:37:25] (03PS5) 10Andrew Bogott: cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) [16:39:58] (03PS1) 10Elukey: admin_ng: remove the secrets chart from knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/755441 (https://phabricator.wikimedia.org/T298976) [16:42:02] hashar: What's allowed to run directly on the host that runs node?! [16:42:25] wrapped in docker [16:42:31] I think that is for the pipelinelib publish step [16:42:37] Ah, OK, that's slightly less worrying, but still. [16:43:15] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Also check for the analytics bridge in eqiad [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 [16:44:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:54] I am guessing the host has to be powercycled [16:47:11] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:36] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) (owner: 10Lucas Werkmeister (WMDE)) [16:50:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10nray) thank you! [16:53:32] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff For clarity, I can do this now for ganeti1018? Are we doing these 1 at a time? Thanks [16:54:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2015.codfw.wmnet with OS buster [16:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:45] (03CR) 10Bking: [C: 03+1] cirrussearch: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/752724 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [16:54:55] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2015.codfw.wmnet [16:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:17] (03CR) 10Eevans: [C: 03+1] "If there are no objections, I could tackle this tomorrow..." [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [16:55:34] 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 (10hashar) Hello #ops-eqiad contint1001.wikimedia.org is unresponsive. Moritz tried to reach it out through the serial console but it... [16:55:35] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb={GET,PATCH,POST,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:55:40] 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 (10Lucas_Werkmeister_WMDE) This was probably caused for https://integration.wikimedia.org/ci/job/termbox-pipeline-rehearse/92/console,... [16:55:51] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb={LIST,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:56:15] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS buster [16:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:51] (03PS2) 10Ayounsi: Atlas exporter: add probes and traceroute mesurements [puppet] - 10https://gerrit.wikimedia.org/r/754880 (https://phabricator.wikimedia.org/T251156) [16:58:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:58:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:29] PROBLEM - DPKG on an-worker1090 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:59:39] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:59:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:01:10] (03CR) 10Ayounsi: [C: 03+2] Atlas exporter: add probes and traceroute mesurements [puppet] - 10https://gerrit.wikimedia.org/r/754880 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [17:02:09] (03CR) 10JMeybohm: [C: 03+1] kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) (owner: 10AOkoth) [17:02:35] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation={create,get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:02:39] PROBLEM - Etcd cluster health on kubetcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [17:03:03] (03CR) 10Joal: [C: 03+1] "Thanks a lot for the patch Sam :) Do we wish to also add high entropy headers or not now?" [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [17:03:05] (03CR) 10Btullis: [V: 04-1 C: 04-1] "Unfortunately the PCC output shows that this is a bit more tricky than I had hoped." [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [17:03:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:05:51] PROBLEM - Etcd cluster health on kubetcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [17:07:02] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Move multiple kubernetes keys to common with ::site variable [puppet] - 10https://gerrit.wikimedia.org/r/754551 (owner: 10JMeybohm) [17:07:58] (03CR) 10Elukey: [C: 04-1] profile::manifests::cache::kafka::webrequest: Log Sec-CH-UA* headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [17:08:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:09:10] (03CR) 10Dzahn: "ok, thanks. Might change it some time in the future but only if it had some actual apache status data. Not just an empty file. Well, but i" [puppet] - 10https://gerrit.wikimedia.org/r/754881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [17:09:39] (03PS7) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [17:09:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:09:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:09:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:10:19] 10SRE, 10MW-on-K8s, 10serviceops: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) 05Open→03Resolved Right now we have just 3 tests not passing: * the nonexistent test, which cannot work without an exposed http endpoint, so we don't really care. *... [17:10:29] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [17:10:37] (03CR) 10Ottomata: [C: 03+2] Reset druid load jobs for network_flows_internal [puppet] - 10https://gerrit.wikimedia.org/r/754994 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [17:10:45] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Cmjohnson) @Marostegui Can we schedule this for me to power down tomorrow (20 Jan) 1530UTC? [17:10:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Reset druid load jobs for network_flows_internal [puppet] - 10https://gerrit.wikimedia.org/r/754994 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [17:10:53] * jayme looking at kube etcd [17:10:55] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) Yes, ganeti1018 is ready to go now. In general I would add the servers which have been emptied/which are ready to go to the task... [17:11:02] (03PS8) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [17:11:15] (03CR) 10Elukey: [C: 04-1] profile::manifests::cache::kafka::webrequest: Log Sec-CH-UA* headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [17:11:37] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) Sounds good, thank you for clarifying. I will take care of 1018 now [17:12:07] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) >>! In T299123#7633254, @Cmjohnson wrote: > @Marostegui Can we schedule this for me to power down tomorrow (20 Jan) 1530UTC? Sounds good. I will leave the host powered down for you, thanks! [17:13:16] (03CR) 10jerkins-bot: [V: 04-1] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [17:14:19] (03PS9) 10Arturo Borrero Gonzalez: toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) [17:14:49] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:14:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:14:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:16:23] RECOVERY - Etcd cluster health on kubetcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [17:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T239814)', diff saved to https://phabricator.wikimedia.org/P18885 and previous config saved to /var/cache/conftool/dbconfig/20220119-171640-ladsgroup.json [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:45] T239814: Automate DB upgrades - https://phabricator.wikimedia.org/T239814 [17:17:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: deploy test suite [puppet] - 10https://gerrit.wikimedia.org/r/755321 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [17:17:39] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:41] (03PS1) 10Jbond: P:pki::multirootca: Make alerts critical [puppet] - 10https://gerrit.wikimedia.org/r/755444 [17:18:08] (03CR) 10Majavah: "pcc: https://puppet-compiler.wmflabs.org/pcc-worker1002/33329/" [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [17:18:33] !log hnowlan@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2016.codfw.wmnet with OS buster [17:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS buster [17:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:19:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:20:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10RobH) [17:22:37] 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 (10hashar) @Cmjohnson acknowledged the issue and will be able to restart the host in a couple hours. The services offered by contint1... [17:23:47] (03PS8) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [17:24:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:24:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:25:21] !log updating firmware, ganeti1018 T299527 [17:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:25] T299527: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 [17:25:49] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [17:26:16] <_joe_> !log powercycling contint1001 via ipmi, T299542 [17:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:19] T299542: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 [17:26:34] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:26:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:28:13] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:28:55] RECOVERY - SSH on contint1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:29:17] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:49] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:29:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:29:57] 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 (10Joe) No need for further restarts, I was able to powercycle the server using ipmi. @Cmjohnson you don't need to do anything :) [17:30:09] 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 (10Joe) 05Open→03Resolved [17:30:55] RECOVERY - DPKG on an-worker1090 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:30:57] 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 (10hashar) Thank you very much for the powercycle. [17:31:07] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:31:19] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:31:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18886 and previous config saved to /var/cache/conftool/dbconfig/20220119-173145-ladsgroup.json [17:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:11] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:32:19] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:32:34] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:32:39] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:32:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:32:59] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:33:22] (03CR) 10Daniel Kinzler: [C: 03+1] "We want this, and it makes sense to me, but I can't vouch for it working right." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755399 (owner: 10Ppchelko) [17:34:34] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:35:26] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2016.codfw.wmnet with OS buster [17:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:39] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS buster [17:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:19] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:37:20] jouncebot: nowandnext [17:37:20] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [17:37:20] In 1 hour(s) and 22 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1900) [17:37:20] In 1 hour(s) and 22 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1900) [17:37:49] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:38:49] (03PS3) 10Majavah: Drop CentralAuthUserMerge log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754998 (https://phabricator.wikimedia.org/T216089) [17:38:52] (03CR) 10Majavah: [C: 03+2] Drop CentralAuthUserMerge log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754998 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [17:40:26] (03Merged) 10jenkins-bot: Drop CentralAuthUserMerge log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754998 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [17:41:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:42:08] (03PS2) 10Majavah: Disable UserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754999 (https://phabricator.wikimedia.org/T216089) [17:42:43] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754998|Drop CentralAuthUserMerge log channel (T216089)]] (duration: 01m 05s) [17:42:44] (03PS1) 10Cmjohnson: Adding new backup1008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/755448 (https://phabricator.wikimedia.org/T294974) [17:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:47] T216089: Undeploy UserMerge Extension from WMF production - https://phabricator.wikimedia.org/T216089 [17:42:56] (03CR) 10Majavah: [C: 03+2] Disable UserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754999 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [17:43:51] (03PS9) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [17:44:27] (03Merged) 10jenkins-bot: Disable UserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754999 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [17:44:34] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [17:44:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:57] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) I updated the firmware on 1018 but also made the error of updating the idrac, the new idrac version needs to be rolled back. I am no long... [17:45:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:45:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:19] (03PS2) 10Jbond: P:pki::multirootca: Make alerts critical [puppet] - 10https://gerrit.wikimedia.org/r/755444 [17:46:49] (03CR) 10Cmjohnson: [C: 03+2] Adding new backup1008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/755448 (https://phabricator.wikimedia.org/T294974) (owner: 10Cmjohnson) [17:46:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18887 and previous config saved to /var/cache/conftool/dbconfig/20220119-174650-ladsgroup.json [17:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33339/console" [puppet] - 10https://gerrit.wikimedia.org/r/755444 (owner: 10Jbond) [17:46:59] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754999|Disable UserMerge (T216089)]] (duration: 00m 54s) [17:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:42] (03PS3) 10Majavah: [wikitech] Drop the 'cloudadmin' user group, no longer used and empty [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [17:47:48] (03CR) 10Majavah: [C: 03+2] [wikitech] Drop the 'cloudadmin' user group, no longer used and empty [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [17:49:25] (03Merged) 10jenkins-bot: [wikitech] Drop the 'cloudadmin' user group, no longer used and empty [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [17:49:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: Make alerts critical [puppet] - 10https://gerrit.wikimedia.org/r/755444 (owner: 10Jbond) [17:50:19] (03PS1) 10Jbond: Revert "role::pki::multirootca: add expiry for k8s_mlserve" [puppet] - 10https://gerrit.wikimedia.org/r/755408 [17:50:51] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:575390|[wikitech] Drop the cloudadmin user group, no longer used and empty (T237890)]] (duration: 00m 50s) [17:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:55] T237890: Remove and empty useless user groups - https://phabricator.wikimedia.org/T237890 [17:51:07] (03PS2) 10Jbond: Do NOT MERGE "role::pki::multirootca: add expiry for k8s_mlserve" [puppet] - 10https://gerrit.wikimedia.org/r/755408 [17:51:31] 10SRE, 10Infrastructure-Foundations: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560 (10JMeybohm) [17:51:34] * taavi done [17:52:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:17] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2016.codfw.wmnet with OS buster [17:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33340/console" [puppet] - 10https://gerrit.wikimedia.org/r/755408 (owner: 10Jbond) [17:54:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS buster [17:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:56:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:07] !log beginning logstash apifeatureusage switchover T297239 [17:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:11] T297239: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 [17:59:04] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/755442 (owner: 10Muehlenhoff) [17:59:53] (03PS2) 10Eevans: Deploy the dev version of cassandra to aqs1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [18:00:01] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T239814)', diff saved to https://phabricator.wikimedia.org/P18888 and previous config saved to /var/cache/conftool/dbconfig/20220119-180154-ladsgroup.json [18:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:59] T239814: Automate DB upgrades - https://phabricator.wikimedia.org/T239814 [18:02:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:12] (03CR) 10Ppchelko: "recheck" [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754910 (owner: 10Ppchelko) [18:03:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:03:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:32] (03PS10) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [18:06:42] 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review: VarnishKafka to propagate user agent client hints headers to webrequest - https://phabricator.wikimedia.org/T299401 (10phuedx) @JAllemandou: @elukey highlighted that we (Data Engineering and other stakeholders) should agree on the names for these he... [18:06:52] (03CR) 10Herron: [C: 03+2] assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [18:06:59] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:07:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_internal_flows_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:45] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:08:00] (03PS1) 10Lucas Werkmeister (WMDE): Configure `mul` language code on Test Wikidata and its clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755453 (https://phabricator.wikimedia.org/T297393) [18:08:00] PROBLEM - LVS logstash-json-tcp eqiad port 11514/tcp - Logstash ingestion json tcp IPv4 #page on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:08:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:08:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33341/console" [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [18:08:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Stalled until I926117a9a7 has been merged and rolled out to at least one train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755453 (https://phabricator.wikimedia.org/T297393) (owner: 10Lucas Werkmeister (WMDE)) [18:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T239814)', diff saved to https://phabricator.wikimedia.org/P18889 and previous config saved to /var/cache/conftool/dbconfig/20220119-180840-ladsgroup.json [18:08:43] herron: godog: ^^ [18:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:44] T239814: Automate DB upgrades - https://phabricator.wikimedia.org/T239814 [18:08:46] logstash alerts are from me [18:08:54] (03PS2) 10Phuedx: profile::cache::kafka::webrequest: Log Sec-CH-UA* headers [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) [18:08:54] ^^ looks to be a part of the apifeatureusage migration [18:08:57] here [18:09:01] please ignore [18:09:09] ack thanks [18:09:13] ok [18:09:31] ignoring :) [18:09:43] !log ladsgroup@cumin1001 START - Cookbook sre.mysql.upgrade for db1110.eqiad.wmnet [18:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:47] 👍 [18:10:04] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2016.codfw.wmnet with OS buster [18:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:28] (03CR) 10Btullis: [V: 03+1] "Sadly that latest patchset didn't make any changes to the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [18:10:50] (03CR) 10Jbond: [C: 03+2] hieradata: pcc: add tools and toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/755313 (owner: 10Majavah) [18:11:11] (03CR) 10Jbond: [C: 03+2] "will merge" [puppet] - 10https://gerrit.wikimedia.org/r/755313 (owner: 10Majavah) [18:11:14] (03PS1) 10Ayounsi: Atlas Exporter: add esams/eqsin/ulsfo probes and traceroute [puppet] - 10https://gerrit.wikimedia.org/r/755455 (https://phabricator.wikimedia.org/T251156) [18:12:48] * volans ignoring too [18:12:50] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/pcc-worker1003/33342/netmon1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/755455 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [18:14:29] (03PS1) 10Herron: logstash: set logstash-json-tcp monitoring to non-critical [puppet] - 10https://gerrit.wikimedia.org/r/755456 (https://phabricator.wikimedia.org/T297239) [18:15:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1110.eqiad.wmnet [18:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:45] (03PS1) 10Andrew Bogott: More dummy passwords for eqiad1 cinder backups [labs/private] - 10https://gerrit.wikimedia.org/r/755457 [18:16:02] (03CR) 10Ayounsi: [C: 03+2] Atlas Exporter: add esams/eqsin/ulsfo probes and traceroute [puppet] - 10https://gerrit.wikimedia.org/r/755455 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [18:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T239814)', diff saved to https://phabricator.wikimedia.org/P18890 and previous config saved to /var/cache/conftool/dbconfig/20220119-181623-ladsgroup.json [18:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:28] T239814: Automate DB upgrades - https://phabricator.wikimedia.org/T239814 [18:16:52] (03CR) 10Cwhite: [C: 03+1] logstash: set logstash-json-tcp monitoring to non-critical [puppet] - 10https://gerrit.wikimedia.org/r/755456 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [18:17:57] 10SRE, 10ops-eqiad, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001.wikimedia.org is almost unresponsive - https://phabricator.wikimedia.org/T299542 (10hashar) I think one of the follow up action is T290608 which is that obsolete intermediate Docker layers and containers a... [18:18:07] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] More dummy passwords for eqiad1 cinder backups [labs/private] - 10https://gerrit.wikimedia.org/r/755457 (owner: 10Andrew Bogott) [18:20:34] (03CR) 10CDanis: [C: 03+1] Atlas Exporter: add esams/eqsin/ulsfo probes and traceroute [puppet] - 10https://gerrit.wikimedia.org/r/755455 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [18:30:54] (03PS3) 10Andrew Bogott: Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) [18:31:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18891 and previous config saved to /var/cache/conftool/dbconfig/20220119-183128-ladsgroup.json [18:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:07] (03CR) 10jerkins-bot: [V: 04-1] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [18:36:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [18:39:38] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add cinder-backup role/profile for eqiad1, use on cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/755057 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [18:41:11] (03PS1) 10Ayounsi: Fix RIPE Atlas icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/755460 (https://phabricator.wikimedia.org/T251156) [18:44:32] (03PS7) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) [18:45:56] (03CR) 10AOkoth: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33346" [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [18:46:01] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33345/" [puppet] - 10https://gerrit.wikimedia.org/r/755460 (https://phabricator.wikimedia.org/T251156) (owner: 10Ayounsi) [18:46:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18892 and previous config saved to /var/cache/conftool/dbconfig/20220119-184632-ladsgroup.json [18:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2016.codfw.wmnet with OS buster [18:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:05] (03PS1) 10AOkoth: otrs: create new vrts hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/755465 (https://phabricator.wikimedia.org/T293942) [18:58:13] (03CR) 10Dzahn: [C: 03+1] otrs: create new vrts hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/755465 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [18:58:48] (03CR) 10AOkoth: [C: 03+2] otrs: create new vrts hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/755465 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [18:58:56] (03CR) 10AOkoth: [V: 03+2 C: 03+2] otrs: create new vrts hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/755465 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [18:59:06] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) apifeatureusage[12]001 are now live, but puppet is currently disabled on these hosts as a couple of small manual fixes had to be put in place... [18:59:59] (03PS1) 10Andrew Bogott: Correct pv names for cloudbackup2001/2 [puppet] - 10https://gerrit.wikimedia.org/r/755466 [19:00:05] jeena and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1900). [19:00:05] RoanKattouw and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T1900). [19:00:05] cjming and Pchelolo: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:19] hello! [19:00:25] hello! [19:00:40] cjming: do you wish to self-service, or should I deploy? [19:01:02] happy to self-service [19:01:06] then go ahead :) [19:01:32] (03CR) 10Clare Ming: [C: 03+2] Update config for pilot wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) (owner: 10Clare Ming) [19:01:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T239814)', diff saved to https://phabricator.wikimedia.org/P18893 and previous config saved to /var/cache/conftool/dbconfig/20220119-190137-ladsgroup.json [19:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:43] T239814: Automate DB upgrades - https://phabricator.wikimedia.org/T239814 [19:01:45] Pchelolo: is it fine if I +2 your backport, and then let you handle it? [19:01:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:01:55] (03PS3) 10Clare Ming: Update config for pilot wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) [19:02:16] (03CR) 10Andrew Bogott: [C: 03+2] Correct pv names for cloudbackup2001/2 [puppet] - 10https://gerrit.wikimedia.org/r/755466 (owner: 10Andrew Bogott) [19:02:27] (03PS2) 10Andrew Bogott: Correct pv names for cloudbackup2001/2 [puppet] - 10https://gerrit.wikimedia.org/r/755466 [19:03:34] (03PS1) 10Herron: logstash: move elk5 collectors to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/755467 (https://phabricator.wikimedia.org/T297239) [19:03:36] (03CR) 10Clare Ming: [C: 03+2] Update config for pilot wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) (owner: 10Clare Ming) [19:04:00] (03CR) 10Herron: [C: 03+2] logstash: set logstash-json-tcp monitoring to non-critical [puppet] - 10https://gerrit.wikimedia.org/r/755456 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [19:04:12] (03CR) 10Dzahn: [C: 03+1] "we made a copy of the existing secrets under the new name and did the same in labs/private as well. now puppet compiler is happy" [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:04:19] urbanecm: would you mind syncing it as well please? [19:04:33] Pchelolo: i have no idea what it does [19:04:33] (03Merged) 10jenkins-bot: Update config for pilot wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755038 (https://phabricator.wikimedia.org/T298519) (owner: 10Clare Ming) [19:04:38] nothing [19:05:11] why are we backporting it then? [19:06:08] (03CR) 10AOkoth: [C: 03+2] OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:06:22] (03PS8) 10AOkoth: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:06:40] (03CR) 10AOkoth: [V: 03+2 C: 03+2] OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:07:10] it does nothing until I merge a config change [19:07:24] just tested my patch on mwdebug1001 and lgtm so syncing now [19:07:43] Pchelolo: ah! Makes sense [19:07:51] (03CR) 10Urbanecm: [C: 03+2] First pass on creating config-schema.yaml [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754910 (owner: 10Ppchelko) [19:08:03] Pchelolo: in that case, I'll sync it. [19:08:09] cjming: let me know once you're done please :) [19:08:22] urbanecm: will do [19:09:37] !log cjming@deploy1002 Synchronized dblists/desktop-improvements.dblist: Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]] (duration: 01m 09s) [19:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:41] T298519: Turn on desktop improvements on new set of pilot wikis - https://phabricator.wikimedia.org/T298519 [19:10:37] !log cjming@deploy1002 Synchronized wmf-config/config/ptwikinews.yaml: Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]] (duration: 00m 50s) [19:10:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:16] (03PS1) 10Herron: profile::apifeatureusage::logstash: update ssl identification and index name [puppet] - 10https://gerrit.wikimedia.org/r/755468 (https://phabricator.wikimedia.org/T297239) [19:11:29] !log cjming@deploy1002 Synchronized wmf-config/config/viwiki.yaml: Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]] (duration: 00m 49s) [19:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:46] (03PS3) 10Eevans: Deploy the dev version of cassandra to aqs1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [19:11:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:11:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:01] cjming: you can sync the config folder by just doing `scap sync-file wmf-config/config message` [19:12:19] (w/o syncing all the individual changes) [19:12:24] thanks - i was wondering [19:12:26] !log cjming@deploy1002 Synchronized wmf-config/config/foundationwiki.yaml: Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]] (duration: 00m 49s) [19:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:41] !log cjming@deploy1002 Synchronized wmf-config/config: message (duration: 00m 50s) [19:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:04] (03PS1) 10Joal: Fix network_flows_internal druid loading jobs [puppet] - 10https://gerrit.wikimedia.org/r/755470 (https://phabricator.wikimedia.org/T263277) [19:14:35] sorry cjming, i meant a justification message (like `Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]]`), not the word "message". Too late for that 🙂 [19:14:35] (03CR) 10Joal: [C: 03+1] "thanks Andrew" [puppet] - 10https://gerrit.wikimedia.org/r/755470 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [19:14:42] my apologies for not making that more clear [19:14:59] (03CR) 10Joal: Fix network_flows_internal druid loading jobs [puppet] - 10https://gerrit.wikimedia.org/r/755470 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [19:15:20] urbanecm: sorry so the complete command is `scap sync-file wmf-config/config Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]]` ? [19:15:21] T298519: Turn on desktop improvements on new set of pilot wikis - https://phabricator.wikimedia.org/T298519 [19:15:55] cjming: yeah. sync-file generally accepts directories just as it accepts files. [19:16:10] (you'd probably need to wrap the message in '' though) [19:16:10] PROBLEM - LVS logstash-json-tcp eqiad port 11514/tcp - Logstash ingestion json tcp IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:16:21] so `scap sync-file wmf-config/config 'Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]]'` [19:16:59] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Aklapper) @joanna_borun: ping [19:17:13] urbanecm: thanks - running now - did i screw anything up with running non-sensical cmds? [19:17:32] (03CR) 10Eevans: "Ok, replicating all of settings seems to work. :/" [puppet] - 10https://gerrit.wikimedia.org/r/754988 (https://phabricator.wikimedia.org/T298516) (owner: 10Btullis) [19:17:33] !log cjming@deploy1002 Synchronized wmf-config/config: Config: [[gerrit:755038|Update config for pilot wikis: (T298519)]] (duration: 00m 49s) [19:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:38] not really. The message is only important because it appears in SAL [19:17:50] (03PS1) 10Andrew Bogott: Correct pv names for cloudbackup2001/2, again [puppet] - 10https://gerrit.wikimedia.org/r/755471 [19:17:56] (so people know what you did on the servers) [19:18:33] urbanecm: got it - gtk -- ok, all done - changes are live - all yours [19:18:42] thanks [19:18:45] * urbanecm waiting on CI [19:19:11] (03CR) 10Andrew Bogott: [C: 03+2] Correct pv names for cloudbackup2001/2, again [puppet] - 10https://gerrit.wikimedia.org/r/755471 (owner: 10Andrew Bogott) [19:19:41] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/pcc-worker1003/33351/" [puppet] - 10https://gerrit.wikimedia.org/r/755468 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [19:19:56] (03CR) 10Herron: [C: 03+2] profile::apifeatureusage::logstash: update ssl identification and index name [puppet] - 10https://gerrit.wikimedia.org/r/755468 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [19:22:58] (03PS2) 10Herron: logstash: move elk5 collectors to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/755467 (https://phabricator.wikimedia.org/T297239) [19:23:02] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/pcc-worker1001/33350/" [puppet] - 10https://gerrit.wikimedia.org/r/755467 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [19:24:31] (03CR) 10Phuedx: profile::cache::kafka::webrequest: Log Sec-CH-UA* headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [19:25:19] (03CR) 10Herron: [C: 03+2] logstash: move elk5 collectors to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/755467 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [19:27:07] (03Merged) 10jenkins-bot: First pass on creating config-schema.yaml [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/754910 (owner: 10Ppchelko) [19:29:35] (03PS1) 10Andrew Bogott: Run cinder-backup-manager on cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/755472 [19:31:01] (03PS1) 10Dzahn: cumin: rename OTRS alias to VRTS after role rename [puppet] - 10https://gerrit.wikimedia.org/r/755473 (https://phabricator.wikimedia.org/T293942) [19:31:16] (03CR) 10Andrew Bogott: [C: 03+2] Run cinder-backup-manager on cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/755472 (owner: 10Andrew Bogott) [19:31:56] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2016.codfw.wmnet with OS buster [19:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:06] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2016.codfw.wmnet [19:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:30] (03CR) 10Dzahn: "compiled and deployed by Arnold in a pairing session. puppet noop on otrs1001 except the motd and the "role to teams" contacts data now sa" [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:34:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:34:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install wtp10[49-72] - https://phabricator.wikimedia.org/T299573 (10RobH) [19:34:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install wtp10[49-72] - https://phabricator.wikimedia.org/T299573 (10RobH) [19:35:15] (03CR) 10Dzahn: "cool! now that this is renamed, please also make another change to remove the old data" [labs/private] - 10https://gerrit.wikimedia.org/r/755465 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [19:35:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:09] (03CR) 10Dzahn: "Arnold also copied secrets in real private repo and fake secrets in labs/private repo that are currently alongside the previous secrets. R" [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:39:04] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash2005.codfw.wmnet, logstash2006.codfw.wmnet are marked down but pooled: kibana_80: Servers logstash2005.codfw.wmnet, logstash2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:40:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10RobH) [19:42:00] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7376 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:43:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:45:23] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash2004.codfw.wmnet [19:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:36] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash2005.codfw.wmnet [19:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:42] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash2006.codfw.wmnet [19:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:46:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10RobH) [19:46:49] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn can you please remove wikimania as an alias from the mail servers controlled by... [19:46:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10RobH) [19:47:08] urbanecm: sorry for jumping at you with my backport. I've had a meeting where and realized the conflict too late [19:47:21] np [19:47:24] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash1007.eqiad.wmnet [19:47:26] I'm free now so I can handle it all myself [19:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:30] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash1008.eqiad.wmnet [19:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:33] !log herron@puppetmaster1001 conftool action : set/pooled=no; selector: name=logstash1009.eqiad.wmnet [19:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:43] i failed to see it merged :D [19:48:07] it did. [19:48:13] ok, I'll go ahead and sync it [19:48:29] syncing already [19:48:38] oh ok, thank you! [19:49:27] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.18/includes/: ed5e634772d2821c6f61903f7341eef4f2fc4337: First pass on creating config-schema.yaml (duration: 01m 02s) [19:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:32] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.33:443, 10.2.1.33:80]) https://wikitech.wikimedia.org/wiki/PyBal [19:50:38] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.33:443, 10.2.1.33:80]) https://wikitech.wikimedia.org/wiki/PyBal [19:50:55] PROBLEM - LVS kibana codfw port 80/tcp - Kibana IPv4 #page on kibana.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.33 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:51:03] hi [19:51:24] Expected again? [19:51:29] sigh, false alarm please ignore [19:51:33] ok [19:51:34] 👍 [19:51:39] <_joe_> ack [19:52:01] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.36:11514, 10.2.2.36:12201, 10.2.2.33:80, 10.2.2.36:8324, 10.2.2.33:443]) https://wikitech.wikimedia.org/wiki/PyBal [19:52:02] PROBLEM - LVS kibana-ssl eqiad port 443/tcp - Kibana - HTTPS IPv4 #page on kibana.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.33 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:52:41] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.18/tests/phpunit/structure/SettingsTest.php: ed5e634772d2821c6f61903f7341eef4f2fc4337: First pass on creating config-schema.yaml (duration: 00m 49s) [19:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:45] Pchelolo: all live [19:53:14] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10RobH) [19:53:51] PROBLEM - LVS kibana eqiad port 80/tcp - Kibana IPv4 #page on kibana.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.33 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:54:01] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10RobH) [19:55:10] (03PS16) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [19:55:21] (03CR) 10Andrew Bogott: [C: 03+2] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [19:56:06] <_joe_> can someone silence the alerts on splunk? [19:56:21] <_joe_> oh just got the resolution now :) [19:56:46] <_joe_> thanks jhathaway [19:56:56] yup [19:57:05] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.36:11514, 10.2.2.36:8324, 10.2.2.33:443, 10.2.2.33:80, 10.2.2.36:12201]) https://wikitech.wikimedia.org/wiki/PyBal [19:58:52] (03Merged) 10jenkins-bot: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [19:58:56] thank you urbanecm [19:59:04] np [19:59:06] (03PS1) 10Herron: set kibana and kibana-ssl monitoring to non-critical [puppet] - 10https://gerrit.wikimedia.org/r/755477 (https://phabricator.wikimedia.org/T281266) [20:00:04] jeena and twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T2000). [20:00:20] still backporting? [20:00:36] jeena: nope, just finished [20:00:38] all yours [20:00:43] okay thanks! [20:00:46] (03CR) 10Herron: [C: 03+2] set kibana and kibana-ssl monitoring to non-critical [puppet] - 10https://gerrit.wikimedia.org/r/755477 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [20:00:47] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7350 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:02:46] (03PS3) 10Aklapper: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:03:12] (03PS4) 10Aklapper: Change / add some namespaces and aliases on arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:04:03] !log rebooting mx1001 to debug conntrack [20:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:08] (03PS1) 10Majavah: Add tls port for cloud vps rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/755478 (https://phabricator.wikimedia.org/T297268) [20:05:03] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7236 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:05:25] (03PS1) 10Jeena Huneidi: group1 wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755479 [20:05:27] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755479 (owner: 10Jeena Huneidi) [20:06:29] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.18 refs T293959 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755479 (owner: 10Jeena Huneidi) [20:08:09] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.18 refs T293959 [20:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:13] T293959: 1.38.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T293959 [20:08:59] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.18 refs T293959 (duration: 00m 49s) [20:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:12:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:52] (03PS1) 10Herron: remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T281266) [20:13:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:35] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7408 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:14:08] (03CR) 10Ottomata: [C: 03+2] Fix network_flows_internal druid loading jobs [puppet] - 10https://gerrit.wikimedia.org/r/755470 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [20:16:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:39] (03PS2) 10Herron: remove elk5 related LVS services [puppet] - 10https://gerrit.wikimedia.org/r/755480 (https://phabricator.wikimedia.org/T281266) [20:18:23] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:20:31] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:20:47] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:33] (03CR) 10Gehel: [C: 04-1] "We need to remove those servers from regex.yaml as well." [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [20:40:52] PROBLEM - LVS kibana codfw port 80/tcp - Kibana IPv4 on kibana.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.33 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:43:16] PROBLEM - LVS kibana-ssl codfw port 443/tcp - Kibana - HTTPS IPv4 on kibana.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.33 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:45:40] PROBLEM - LVS kibana eqiad port 80/tcp - Kibana IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.33 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:48:08] PROBLEM - LVS kibana-ssl eqiad port 443/tcp - Kibana - HTTPS IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.33 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:50:55] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: Add support for listening on TLS [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [20:51:34] ACKNOWLEDGEMENT - LVS kibana codfw port 80/tcp - Kibana IPv4 on kibana.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.33 and port 80: Connection refused Herron service being decommed https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:51:34] ACKNOWLEDGEMENT - LVS kibana-ssl codfw port 443/tcp - Kibana - HTTPS IPv4 on kibana.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.33 and port 443: Connection refused Herron service being decommed https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:51:34] ACKNOWLEDGEMENT - LVS kibana eqiad port 80/tcp - Kibana IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.33 and port 80: Connection refused Herron service being decommed https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:51:35] ACKNOWLEDGEMENT - LVS kibana-ssl eqiad port 443/tcp - Kibana - HTTPS IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.33 and port 443: Connection refused Herron service being decommed https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:51:35] ACKNOWLEDGEMENT - LVS logstash-json-tcp eqiad port 11514/tcp - Logstash ingestion json tcp IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 11514: Connection refused Herron service being decommed https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:51:35] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.36:11514, 10.2.2.36:8324, 10.2.2.33:443, 10.2.2.33:80, 10.2.2.36:12201]) Herron service being decommed https://wikitech.wikimedia.org/wiki/PyBal [20:51:35] ACKNOWLEDGEMENT - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled Herron service being decommed ht [20:51:35] kitech.wikimedia.org/wiki/PyBal [20:51:36] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.36:11514, 10.2.2.36:12201, 10.2.2.33:80, 10.2.2.36:8324, 10.2.2.33:443]) Herron service being decommed https://wikitech.wikimedia.org/wiki/PyBal [20:51:37] ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled Herron service being decommed ht [20:51:37] kitech.wikimedia.org/wiki/PyBal [20:51:37] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.33:443, 10.2.1.33:80]) Herron service being decommed https://wikitech.wikimedia.org/wiki/PyBal [20:51:38] ACKNOWLEDGEMENT - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash2005.codfw.wmnet, logstash2006.codfw.wmnet are marked down but pooled: kibana_80: Servers logstash2005.codfw.wmnet, logstash2006.codfw.wmnet are marked down but pooled Herron service being decommed https://wikitech.wikimedia.org/wiki/PyBal [20:51:39] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.33:443, 10.2.1.33:80]) Herron service being decommed https://wikitech.wikimedia.org/wiki/PyBal [20:52:15] !log depool mw1340 (api_appserver) for performance and php-apcu testing [20:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:54] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) [20:58:03] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) [20:58:17] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) 05Open→03Resolved a:03herron >>! In T297239#7633975, @herron wrote: > apifeatureusage[12]001 are now live, but puppet is currently disab... [20:58:23] (03PS1) 10Majavah: rabbitmq: fix config syntax error [puppet] - 10https://gerrit.wikimedia.org/r/755488 [21:00:04] jeena and twentyafterfour: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T2000). [21:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220119T2100). [21:00:26] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: fix config syntax error [puppet] - 10https://gerrit.wikimedia.org/r/755488 (owner: 10Majavah) [21:01:30] (03PS1) 10Andrew Bogott: toolforge grid engine: install fdm [puppet] - 10https://gerrit.wikimedia.org/r/755489 (https://phabricator.wikimedia.org/T297683) [21:05:40] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:43] 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10nshahquinn-wmf) [21:09:27] 10SRE-Access-Requests: Requesting LDAP-only access to analytics-private-data for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10nshahquinn-wmf) @DannyH can you approve this access for Madalina? [21:10:12] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) >>! In T281266#7634280, @gerritbot wrote: > Change 755480 had a related patch set uploaded (by Herron; author: Herron): > %%%[operations/puppet@produc... [21:20:12] (03CR) 10Krinkle: [C: 03+1] apache: Replace zero.wikipedia.org vhost alias with redirect [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [21:20:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:58] (03PS1) 10Majavah: P:openstack::rabbitmq: add tls ports to firewall [puppet] - 10https://gerrit.wikimedia.org/r/755492 (https://phabricator.wikimedia.org/T297268) [21:22:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:27:28] thanks for the thanks James_F, and happy new year by the way :-) [21:29:46] hauskatze: Happy New Year too. :-) [21:30:38] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7369 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:42:28] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7398 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:42:47] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [21:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:13] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 26s) [21:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:48:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:53:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [21:58:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [22:06:56] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:07:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10DAbad) >>! In T298124#7586079, @Dzahn wrote: > @DAbad Thank you. Done! Is it ok if we also add you as the approver for the 2 groups "analytics-platform-eng-admins" and... [22:20:22] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7094 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:25:54] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:51:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10RobH) [22:51:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10RobH) [22:52:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10RobH) [22:52:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) 05Resolved→03Open Ok, thank you. we'll add you! [22:55:52] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7259 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:57:06] (03PS1) 10Dzahn: admin: add Desiree Abad as approver for platform-engineering groups [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) [22:57:38] (03CR) 10Dzahn: [C: 03+1] "Who's approving the approver" [puppet] - 10https://gerrit.wikimedia.org/r/755500 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn) [22:59:27] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T299609 (10RobH) [23:01:06] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T299609 (10RobH) @RKemper, >>! In T297645#7628400, @faidon wrote: > Quote LGTM. > > On the racking proposal, I'd say that this could potent... [23:01:18] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T299609 (10RobH) [23:01:42] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10RobH) [23:03:55] (03CR) 10Dzahn: [C: 03+2] add miscweb to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/753846 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [23:11:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install labstore100[89] - https://phabricator.wikimedia.org/T299610 (10RobH) [23:24:40] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:30:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:32:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:35:15] !log puppetmaster1001 - revoked puppet cert miscweb.discovery.wmnet; updated kube_services.crts.yaml to include static-bugzilla.wikimedia.org, removed miscweb.discovery.wmnet.crt and .csr.pem, used cergen to check and regenerate cert, committed in private repo, ran puppet on deploy1001 - checked cert in /etc/helmfile-defaults/private/main_services/miscweb/eqiad.yaml with 'openssl x509 [23:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:21] -noout -text -in .. | grep DNS'. now has static-bz on it. (T281538) [23:35:22] T281538: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 [23:35:59] !log deploy1002 - checked freshly generated cert in /etc/helmfile-defaults/private/main_services/miscweb/eqiad.yaml with 'openssl x509 -noout -text -in .. | grep DNS'. now has static-bz on it. (T281538) [23:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:07] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7629565, @colewhite wrote: > Deviations are sometimes necessary to maintain human-readability. When deviations are neces... [23:52:38] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7163 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [23:59:42] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7304 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data