[00:07:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:57] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:49] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: discard_held_messages.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:22] (03PS1) 10BryanDavis: toolhub: Bump container version to 2022-03-15-002555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/770638 (https://phabricator.wikimedia.org/T268774) [00:36:18] (03CR) 10BryanDavis: toolhub: Bump container version to 2022-03-15-002555-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/770638 (https://phabricator.wikimedia.org/T268774) (owner: 10BryanDavis) [00:44:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22461 and previous config saved to /var/cache/conftool/dbconfig/20220315-004445-marostegui.json [00:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:50] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [00:48:31] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 76 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:48:39] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 74 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:59:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P22462 and previous config saved to /var/cache/conftool/dbconfig/20220315-005950-marostegui.json [00:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T0100) [01:02:05] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 62 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:09:05] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 56 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:13:03] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10GeneralNotability) >>! In T303774#7775964, @MZMcBride wrote: > Likely related: !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P22463 and previous config saved to /var/cache/conftool/dbconfig/20220315-011455-marostegui.json [01:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:04] 10SRE, 10ops-eqsin, 10DC-Ops: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10RobH) [01:26:59] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CentralAuth/maintenance/populateGlobalEditCount.php: fix script bug gerrit 770058 (duration: 00m 50s) [01:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:06] 10SRE, 10ops-eqsin, 10DC-Ops: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10RobH) The new mr1 is racked up and in netbox, as well as on scs port 10. /var/tmp/usb is the mount for the usb stick, and the image file is already copied over to /var/tmp just in c... [01:30:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22464 and previous config saved to /var/cache/conftool/dbconfig/20220315-013000-marostegui.json [01:30:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [01:30:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [01:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:05] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [01:30:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T300775)', diff saved to https://phabricator.wikimedia.org/P22465 and previous config saved to /var/cache/conftool/dbconfig/20220315-013013-marostegui.json [01:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:53] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 75%, RTA = 3216.45 ms [01:47:27] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 349 probes of 745 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:48:41] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 125 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:00:04] James_F: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Manual branching of MediaWiki, extensions, skins, and vendor for REL1_38 – see T302909 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T0200). [02:00:05] T302909: Branch REL1_38 for MediaWiki and all extensions and skins - https://phabricator.wikimedia.org/T302909 [02:00:09] Whee. [02:00:16] Waiting for the bot first. [02:00:23] (CC Reedy in case you're around and care. ;-)) [02:00:34] ohai [02:02:37] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 73.83 ms [02:05:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:37] Reedy: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/770658 if you want to be FRIST PSOT in REL1_39. ;-) [02:05:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:05:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [02:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:43] PROBLEM - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:07:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.26 [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770661 [02:07:19] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.26 [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770661 (owner: 10TrainBranchBot) [02:07:21] Whee. [02:07:25] OK, time for me to get started. [02:07:55] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 8 probes of 745 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:09:07] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 58 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:26:53] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott ongoing re-image woes https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:26:58] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.26 [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770661 (owner: 10TrainBranchBot) [02:27:01] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:27:25] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:31:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:32:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:33] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:46] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10MZMcBride) I'm a volunteer and my IRC bot was working its way through a very large queue due to these blocks. Some user many y... [03:45:48] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10TThoabala) [03:47:15] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:55] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10TThoabala) [05:24:37] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300775)', diff saved to https://phabricator.wikimedia.org/P22466 and previous config saved to /var/cache/conftool/dbconfig/20220315-052935-marostegui.json [05:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:41] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [05:44:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P22467 and previous config saved to /var/cache/conftool/dbconfig/20220315-054440-marostegui.json [05:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P22468 and previous config saved to /var/cache/conftool/dbconfig/20220315-055945-marostegui.json [05:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [06:08:29] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:11:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:11:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300775)', diff saved to https://phabricator.wikimedia.org/P22469 and previous config saved to /var/cache/conftool/dbconfig/20220315-061450-marostegui.json [06:14:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:14:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:54] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T300775)', diff saved to https://phabricator.wikimedia.org/P22470 and previous config saved to /var/cache/conftool/dbconfig/20220315-061458-marostegui.json [06:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P22471 and previous config saved to /var/cache/conftool/dbconfig/20220315-061626-root.json [06:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:35] (03PS1) 10Marostegui: db1166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770831 (https://phabricator.wikimedia.org/T300600) [06:25:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P22472 and previous config saved to /var/cache/conftool/dbconfig/20220315-062543-marostegui.json [06:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:12] !log dbmaint on s3@eqiad T300600 [06:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:15] T300600: Upgrade s3 to Bullseye - https://phabricator.wikimedia.org/T300600 [06:26:53] (03CR) 10Marostegui: [C: 03+2] db1166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770831 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui) [06:28:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1166.eqiad.wmnet with OS bullseye [06:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:53] (03PS1) 10Marostegui: change_page_touched_T298557.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) [06:31:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P22473 and previous config saved to /var/cache/conftool/dbconfig/20220315-063130-root.json [06:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:50] (03PS2) 10Marostegui: change_page_touched_T298557.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) [06:38:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki-history-drop-snapshot.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage [06:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage [06:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22474 and previous config saved to /var/cache/conftool/dbconfig/20220315-064634-root.json [06:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:30] (03PS1) 10Marostegui: Revert "db1166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770063 [06:53:20] (03CR) 10Marostegui: [C: 03+2] Revert "db1166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770063 (owner: 10Marostegui) [06:53:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22475 and previous config saved to /var/cache/conftool/dbconfig/20220315-065337-root.json [06:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1166.eqiad.wmnet with OS bullseye [06:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1, awight, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22476 and previous config saved to /var/cache/conftool/dbconfig/20220315-070138-root.json [07:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:57] i can deploy today [07:03:09] kart_: unless you want to self-service [07:06:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:06:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22477 and previous config saved to /var/cache/conftool/dbconfig/20220315-070635-marostegui.json [07:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:41] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:08:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22478 and previous config saved to /var/cache/conftool/dbconfig/20220315-070841-root.json [07:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22479 and previous config saved to /var/cache/conftool/dbconfig/20220315-071642-root.json [07:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22480 and previous config saved to /var/cache/conftool/dbconfig/20220315-072345-root.json [07:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:13] urbanecm: it seems I did wrong schedule. It should be tomorrow. [07:29:31] kart_: no problem with me :) [07:29:38] let's wait for tomorrow then [07:30:44] Fixed deployment page. Sorry! [07:31:21] Because, patch requies to be available in wmf.26 [07:31:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22481 and previous config saved to /var/cache/conftool/dbconfig/20220315-073146-root.json [07:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:49] ie group0 need to be on wmf.26 for that. [07:38:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22482 and previous config saved to /var/cache/conftool/dbconfig/20220315-073849-root.json [07:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:25] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:43:54] !log restart kube-api server on ml-serve-ctrl2002 - 504 responses registered, corresponding to high custom resource definition requests [07:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22483 and previous config saved to /var/cache/conftool/dbconfig/20220315-074650-root.json [07:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P22484 and previous config saved to /var/cache/conftool/dbconfig/20220315-074825-root.json [07:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22485 and previous config saved to /var/cache/conftool/dbconfig/20220315-075353-root.json [07:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300775)', diff saved to https://phabricator.wikimedia.org/P22486 and previous config saved to /var/cache/conftool/dbconfig/20220315-075402-marostegui.json [07:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:06] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [08:00:20] (03PS1) 10Marostegui: db1166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770875 (https://phabricator.wikimedia.org/T300473) [08:01:25] (03PS2) 10Marostegui: db1161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770875 (https://phabricator.wikimedia.org/T300473) [08:01:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161', diff saved to https://phabricator.wikimedia.org/P22487 and previous config saved to /var/cache/conftool/dbconfig/20220315-080128-marostegui.json [08:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P22488 and previous config saved to /var/cache/conftool/dbconfig/20220315-080329-root.json [08:03:32] (03PS1) 10Majavah: hieradata: add codesearch 'wmcs' instance [puppet] - 10https://gerrit.wikimedia.org/r/770877 [08:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:51] (03CR) 10Marostegui: [C: 03+2] db1161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/770875 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui) [08:05:01] (03PS1) 10MMandere: varnish: enable docker build from cache [puppet] - 10https://gerrit.wikimedia.org/r/770878 (https://phabricator.wikimedia.org/T303794) [08:05:05] !log dbmaint on s5@eqiad T300473 [08:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:10] T300473: Upgrade s5 to Bullseye - https://phabricator.wikimedia.org/T300473 [08:06:15] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes2006 [puppet] - 10https://gerrit.wikimedia.org/r/770879 (https://phabricator.wikimedia.org/T300744) [08:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22489 and previous config saved to /var/cache/conftool/dbconfig/20220315-080651-marostegui.json [08:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:56] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [08:07:28] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10Bugreporter) I really want to ask: Why we does not have a bot to globally block known proxies? It may be meaningful to bring S... [08:08:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22490 and previous config saved to /var/cache/conftool/dbconfig/20220315-080857-root.json [08:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P22491 and previous config saved to /var/cache/conftool/dbconfig/20220315-080907-marostegui.json [08:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:04] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:13:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1161.eqiad.wmnet with OS bullseye [08:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:00] (03CR) 10Muehlenhoff: [C: 03+2] Stop pinning the TGC cookie to the user agent and IP address [puppet] - 10https://gerrit.wikimedia.org/r/769753 (https://phabricator.wikimedia.org/T273858) (owner: 10Muehlenhoff) [08:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22492 and previous config saved to /var/cache/conftool/dbconfig/20220315-081835-root.json [08:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:55] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs for kubernetes2006 [puppet] - 10https://gerrit.wikimedia.org/r/770879 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:21:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22493 and previous config saved to /var/cache/conftool/dbconfig/20220315-082157-marostegui.json [08:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:26] (03PS1) 10Majavah: hieradata: remove unused deployment-prep scap targets [puppet] - 10https://gerrit.wikimedia.org/r/770880 [08:24:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22495 and previous config saved to /var/cache/conftool/dbconfig/20220315-082401-root.json [08:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P22496 and previous config saved to /var/cache/conftool/dbconfig/20220315-082412-marostegui.json [08:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: host reimage [08:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1161.eqiad.wmnet with reason: host reimage [08:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:49] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes2006 [puppet] - 10https://gerrit.wikimedia.org/r/770879 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:33:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22497 and previous config saved to /var/cache/conftool/dbconfig/20220315-083338-root.json [08:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:45] (03PS1) 10Muehlenhoff: Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/770881 [08:37:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22498 and previous config saved to /var/cache/conftool/dbconfig/20220315-083701-marostegui.json [08:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:38:19] (03PS1) 10Marostegui: Revert "db1161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770064 [08:39:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300775)', diff saved to https://phabricator.wikimedia.org/P22499 and previous config saved to /var/cache/conftool/dbconfig/20220315-083917-marostegui.json [08:39:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:39:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:22] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [08:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22500 and previous config saved to /var/cache/conftool/dbconfig/20220315-083925-marostegui.json [08:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] varnish: enable docker build from cache [puppet] - 10https://gerrit.wikimedia.org/r/770878 (https://phabricator.wikimedia.org/T303794) (owner: 10MMandere) [08:40:26] (KubernetesCalicoDown) firing: kubernetes2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [08:42:11] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10MZMcBride) >>! In T303774#7776704, @Bugreporter wrote: > I really want to ask: Why we does not have a bot to globally block kn... [08:42:41] (03CR) 10Marostegui: [C: 03+2] Revert "db1161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/770064 (owner: 10Marostegui) [08:42:50] (03CR) 10MMandere: [C: 03+2] varnish: enable docker build from cache [puppet] - 10https://gerrit.wikimedia.org/r/770878 (https://phabricator.wikimedia.org/T303794) (owner: 10MMandere) [08:42:52] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1161.eqiad.wmnet with OS bullseye [08:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:17] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/770881 (owner: 10Muehlenhoff) [08:44:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22501 and previous config saved to /var/cache/conftool/dbconfig/20220315-084425-root.json [08:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22502 and previous config saved to /var/cache/conftool/dbconfig/20220315-084842-root.json [08:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:40] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@a85cf25] (codfw): Switchover to eqiad tegola on eqiad env [08:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:23] !log dbmaint on s5@eqiad T297189 [08:50:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1095:3315', diff saved to https://phabricator.wikimedia.org/P22503 and previous config saved to /var/cache/conftool/dbconfig/20220315-085026-marostegui.json [08:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:27] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [08:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298563)', diff saved to https://phabricator.wikimedia.org/P22504 and previous config saved to /var/cache/conftool/dbconfig/20220315-085206-marostegui.json [08:52:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:52:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:11] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [08:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22505 and previous config saved to /var/cache/conftool/dbconfig/20220315-085214-marostegui.json [08:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:03] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@a85cf25] (codfw): Switchover to eqiad tegola on eqiad env (duration: 03m 22s) [08:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:57] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@a85cf25] (eqiad): Switchover to eqiad tegola on eqiad env [08:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:52] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@a85cf25] (eqiad): Switchover to eqiad tegola on eqiad env (duration: 01m 55s) [08:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22506 and previous config saved to /var/cache/conftool/dbconfig/20220315-085929-root.json [08:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:59:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] (03PS1) 10KartikMistry: Disable ContentTranslation for non-extended confirmed users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770882 (https://phabricator.wikimedia.org/T299636) [09:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22507 and previous config saved to /var/cache/conftool/dbconfig/20220315-090346-root.json [09:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:15] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 70 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:07:42] PROBLEM - MariaDB Replica Lag: s5 #page on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 974.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:07:48] checking [09:07:57] * volans here [09:08:04] thx, let us know if you need help [09:08:05] * Emperor here [09:08:06] related to the account propagation? [09:08:28] nope [09:08:33] related to a schema change that failed [09:08:44] should recover in a bit [09:08:49] ack [09:09:32] the host is depooled anyways [09:09:35] so no user impact [09:09:36] Speaking of ack, I acked the page [09:09:41] thanks sobanski [09:09:48] RECOVERY - MariaDB Replica Lag: s5 #page on db1096 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:09:52] \o/ [09:10:06] I'm assuming it's ok to resolve the incident as well [09:10:10] yep [09:10:12] here [09:10:19] nvm, auto-resolved [09:10:20] Amir1: too slow old man ;p [09:10:26] (KubernetesCalicoDown) resolved: kubernetes2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:10:38] Emperor: :D [09:10:39] 🐢 [09:10:43] that was intentional [09:11:02] the schema change should have downtimed the host [09:11:21] Amir1: i didn't use the script, as it was the testing for the flaggedrevs script [09:11:29] aah [09:11:30] ok [09:11:57] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:33] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 60 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:13:53] checking user impact, seems it was very low- only errors on timer [09:14:06] jynus: as I said above, the host was depooled, no impact [09:14:16] ah, I didn't get that, sorry [09:14:29] no worries! [09:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22509 and previous config saved to /var/cache/conftool/dbconfig/20220315-091433-root.json [09:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:38] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22510 and previous config saved to /var/cache/conftool/dbconfig/20220315-091850-root.json [09:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P22511 and previous config saved to /var/cache/conftool/dbconfig/20220315-091906-root.json [09:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:46] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22512 and previous config saved to /var/cache/conftool/dbconfig/20220315-092937-root.json [09:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22513 and previous config saved to /var/cache/conftool/dbconfig/20220315-093410-root.json [09:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:44] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [09:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22514 and previous config saved to /var/cache/conftool/dbconfig/20220315-094441-root.json [09:44:42] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22515 and previous config saved to /var/cache/conftool/dbconfig/20220315-094914-root.json [09:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:27] (03CR) 10David Caro: "The error seems not to be related to the patch, probably caused by the rebase, looking, reviews are still welcome though" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [09:58:07] (03PS16) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [09:59:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22516 and previous config saved to /var/cache/conftool/dbconfig/20220315-095945-root.json [09:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:22] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34289/console" [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22517 and previous config saved to /var/cache/conftool/dbconfig/20220315-100418-root.json [10:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [10:07:14] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: restrict docker traffic with additional ferm rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:07:34] PROBLEM - puppet last run on gitlab-runner2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:07:40] (03PS17) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) [10:08:22] (03CR) 10Btullis: [C: 03+1] Fix Cumin alias for an-tool* [puppet] - 10https://gerrit.wikimedia.org/r/767711 (owner: 10Muehlenhoff) [10:08:44] (03CR) 10Btullis: [C: 03+1] Also include staging server in analytics-tools Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/766567 (owner: 10Muehlenhoff) [10:13:56] !log start of foreachwikiindblist all maintenance/refreshImageMetadata.php --force --verbose --mediatype=AUDIO --sleep 2 --oldimage (T226311) [10:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:00] T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes - https://phabricator.wikimedia.org/T226311 [10:14:26] RECOVERY - puppet last run on gitlab-runner2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:14:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22518 and previous config saved to /var/cache/conftool/dbconfig/20220315-101449-root.json [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:37] 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10Aklapper) [Please add project tags under project tags instead of subscribers - thanks!] [10:17:40] 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 fails to pxe boot - https://phabricator.wikimedia.org/T303776 (10Aklapper) [Please add project tags under project tags instead of subscribers - thanks!] [10:19:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22519 and previous config saved to /var/cache/conftool/dbconfig/20220315-101900-marostegui.json [10:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:05] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [10:19:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22520 and previous config saved to /var/cache/conftool/dbconfig/20220315-101922-root.json [10:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:26] (03PS1) 10AikoChou: ml-services: update arwiki editquality predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/770886 (https://phabricator.wikimedia.org/T301766) [10:27:40] (03PS1) 10Jbond: O:installserver::light: add logstash logging to all install servers [puppet] - 10https://gerrit.wikimedia.org/r/770887 [10:28:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34290/console" [puppet] - 10https://gerrit.wikimedia.org/r/770887 (owner: 10Jbond) [10:29:25] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:installserver::light: add logstash logging to all install servers [puppet] - 10https://gerrit.wikimedia.org/r/770887 (owner: 10Jbond) [10:30:23] (03CR) 10Ayounsi: [C: 03+1] "lgtm if PCC is heppy!" [puppet] - 10https://gerrit.wikimedia.org/r/770887 (owner: 10Jbond) [10:34:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22521 and previous config saved to /var/cache/conftool/dbconfig/20220315-103405-marostegui.json [10:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:11] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10jbond) [10:37:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10jbond) [10:42:55] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [10:43:00] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10ayounsi) [10:43:08] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [10:44:31] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: reduce verbosity and clean up set-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770888 [10:45:06] urbanecm: hi, I am trying to find out what kind of changes are normally added to operations/mediawiki-config ` during a backport and someone mentioned you'd be the right person to ask. I know developers add feature toggles there, but I was wondering if there's anything else [10:46:56] jnuche: hi, basically whatever MW needs to know :). It can have configuration for MW/extensions (like, usergroups that exist) or configuration we need (like DB name or DB username). It has quite a lot of things – is there anything in particular you're interested? [10:47:14] (03CR) 10jerkins-bot: [V: 04-1] wmcs: openstack: reduce verbosity and clean up set-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770888 (owner: 10Arturo Borrero Gonzalez) [10:47:25] https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes lists a few typical changes, but that's written from the community's POV (what changes they need to request via a phab ticket, basically) [10:49:11] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [10:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22522 and previous config saved to /var/cache/conftool/dbconfig/20220315-104910-marostegui.json [10:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:29] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [10:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:18] well, we also seem to update the config during our regular RelEng train deployments and I was wondering about how it is decided what goes where. But if the backports are used for so many different things, I get the feeling I'm missing something about our own train deployments, I'll keep looking there [10:51:28] thanks for that wiki link though, that's helpful :) [10:55:11] jnuche: train deployments don't handle the config repo [10:55:12] (03PS1) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 [10:55:18] RhinosF1: they do touch it though [10:55:20] wikiversions.json [10:55:40] urbanecm: yeah promote is in there, but no other changes would [10:56:05] (03PS2) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 [10:56:53] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [10:57:25] (03CR) 10jerkins-bot: [V: 04-1] mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [10:57:45] jnuche: backport and config windows aren't the only windows where config changes can happen. Only "regular" changes happen there, but notable changes often have their own window (either regular, such as train, or scheduled on as-needed basis) [10:58:21] for instance, when we create a new wiki, a lot of the repo needs to be changed to include it, but it's done in an extra window [10:58:35] https://wikitech.wikimedia.org/wiki/Deployments/Inclusion_criteria has some documentation about what should have its own window and what can go through B&C [11:00:20] (it's not rigidly followed though, i might message Tyler and discuss a possible update to follow reality better) [11:02:35] urbanecm: I see, thank you! I think I need to dig into `scap`a bit more to see exactly how the config repo is handled, but all that background helps [11:02:55] RhinosF1: thank you too [11:03:05] jnuche: there are deployment trainings every Thursday in case you want to show up there :) [11:04:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22523 and previous config saved to /var/cache/conftool/dbconfig/20220315-110416-marostegui.json [11:04:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:04:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:20] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [11:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22524 and previous config saved to /var/cache/conftool/dbconfig/20220315-110423-marostegui.json [11:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:28] jnuche: and in case you have questions outside of Thursdays, happy to answer 'em too :D [11:05:24] urbanecm: nice, I wasn't aware of those, I'll join one of the sessions at some point (they are way out of my working hours unfortunately) [11:05:26] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:05:30] appreciated :) [11:05:30] (03PS1) 10Jelto: gitlab_runner: add missing hiera entry for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/770891 (https://phabricator.wikimedia.org/T295481) [11:06:32] disadvantage of being international :/ [11:06:33] (03PS3) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 [11:08:12] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [11:09:22] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34291/console" [puppet] - 10https://gerrit.wikimedia.org/r/770891 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:10:26] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add missing hiera entry for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/770891 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:11:12] (03PS4) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 [11:11:49] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [11:15:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:15:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:22] (03PS1) 10Jbond: O:kafka::logging: ensure that all base classes are initiated first [puppet] - 10https://gerrit.wikimedia.org/r/770892 [11:16:40] (03PS5) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 [11:16:55] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [11:17:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34292/console" [puppet] - 10https://gerrit.wikimedia.org/r/770892 (owner: 10Jbond) [11:17:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2110.codfw.wmnet with reason: Maintenance [11:17:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2110.codfw.wmnet with reason: Maintenance [11:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 12 hosts with reason: Maintenance [11:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 12 hosts with reason: Maintenance [11:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:20:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:30] (03CR) 10Ladsgroup: "PCC seems happy https://puppet-compiler.wmflabs.org/pcc-worker1003/1233/" [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [11:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:51] (03PS6) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [11:22:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:22:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298743)', diff saved to https://phabricator.wikimedia.org/P22525 and previous config saved to /var/cache/conftool/dbconfig/20220315-112308-ladsgroup.json [11:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:11] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [11:25:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10ayounsi) Thanks for looking into that @jbond! As we now have [[ https://logstash.wikimedia.org/app/dashboards#/view/58c908a0-a394-11ec-bf8e-43f1807d5bc2? | better auditabili... [11:27:36] (03PS1) 10Jelto: Revert "gitlab_runner: add missing hiera entry for WMCS" [puppet] - 10https://gerrit.wikimedia.org/r/770065 [11:27:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298743)', diff saved to https://phabricator.wikimedia.org/P22526 and previous config saved to /var/cache/conftool/dbconfig/20220315-112754-ladsgroup.json [11:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:22] (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: add missing hiera entry for WMCS" [puppet] - 10https://gerrit.wikimedia.org/r/770065 (owner: 10Jelto) [11:29:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] [buildservice] Add a cookbook to update the needed images (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [11:29:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [11:29:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "decide on a style :-)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [11:31:46] (03PS1) 10Jelto: gitlab_runner: add missing hiera entry for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/770893 (https://phabricator.wikimedia.org/T295481) [11:36:07] (03CR) 10Jelto: [C: 03+2] gitlab_runner: add missing hiera entry for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/770893 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:38:16] (03PS2) 10Jbond: O:kafka::logging: ensure that all base classes are initiated first [puppet] - 10https://gerrit.wikimedia.org/r/770892 [11:38:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34293/console" [puppet] - 10https://gerrit.wikimedia.org/r/770892 (owner: 10Jbond) [11:41:50] (03PS3) 10Jbond: O:kafka::logging: ensure that all base classes are initiated first [puppet] - 10https://gerrit.wikimedia.org/r/770892 [11:41:52] (03PS1) 10Jbond: P:java: add explicit dependency for java class [puppet] - 10https://gerrit.wikimedia.org/r/770894 [11:42:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34294/console" [puppet] - 10https://gerrit.wikimedia.org/r/770892 (owner: 10Jbond) [11:43:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22527 and previous config saved to /var/cache/conftool/dbconfig/20220315-114259-ladsgroup.json [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34295/console" [puppet] - 10https://gerrit.wikimedia.org/r/770894 (owner: 10Jbond) [11:48:53] (03PS1) 10Elukey: kserve-inference: fix custom image template variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/770895 [11:49:22] (03PS1) 10Muehlenhoff: Various Stretch tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/770896 [11:51:27] (03PS1) 10David Caro: sre.hosts.provision: remove double space [cookbooks] - 10https://gerrit.wikimedia.org/r/770898 [11:52:17] (03CR) 10David Caro: "See https://integration.wikimedia.org/ci/job/tox-docker/24364/console" [cookbooks] - 10https://gerrit.wikimedia.org/r/770898 (owner: 10David Caro) [11:52:31] (03CR) 10Muehlenhoff: [C: 03+2] Various Stretch tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/770896 (owner: 10Muehlenhoff) [11:54:10] (03CR) 10Elukey: [C: 03+1] P:java: add explicit dependency for java class [puppet] - 10https://gerrit.wikimedia.org/r/770894 (owner: 10Jbond) [11:54:15] (03CR) 10Volans: [C: 03+2] "Thanks for the fix." [cookbooks] - 10https://gerrit.wikimedia.org/r/770898 (owner: 10David Caro) [11:56:44] (03Merged) 10jenkins-bot: sre.hosts.provision: remove double space [cookbooks] - 10https://gerrit.wikimedia.org/r/770898 (owner: 10David Caro) [11:58:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22528 and previous config saved to /var/cache/conftool/dbconfig/20220315-115804-ladsgroup.json [11:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:java: add explicit dependency for java class [puppet] - 10https://gerrit.wikimedia.org/r/770894 (owner: 10Jbond) [12:07:16] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:13:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298743)', diff saved to https://phabricator.wikimedia.org/P22529 and previous config saved to /var/cache/conftool/dbconfig/20220315-121309-ladsgroup.json [12:13:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1141.eqiad.wmnet with reason: Maintenance [12:13:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1141.eqiad.wmnet with reason: Maintenance [12:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:15] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [12:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298743)', diff saved to https://phabricator.wikimedia.org/P22530 and previous config saved to /var/cache/conftool/dbconfig/20220315-121317-ladsgroup.json [12:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:25] (03PS3) 10Muehlenhoff: Fix Cumin alias for an-tool* [puppet] - 10https://gerrit.wikimedia.org/r/767711 [12:15:38] (03CR) 10Muehlenhoff: [C: 03+2] Fix Cumin alias for an-tool* [puppet] - 10https://gerrit.wikimedia.org/r/767711 (owner: 10Muehlenhoff) [12:17:57] (03PS2) 10Muehlenhoff: Require Python 3.7/buster for logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/767064 [12:24:05] (03CR) 10David Caro: buildservice: Add some sal logs when updating the base images (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:24:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22531 and previous config saved to /var/cache/conftool/dbconfig/20220315-122421-marostegui.json [12:24:24] !log updating Exim on mx2001 T303738 [12:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:26] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [12:24:27] (03CR) 10David Caro: [buildservice] Add a cookbook to update the needed images (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298743)', diff saved to https://phabricator.wikimedia.org/P22532 and previous config saved to /var/cache/conftool/dbconfig/20220315-122748-ladsgroup.json [12:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:52] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [12:28:23] (03PS2) 10David Caro: [buildservice] Add a cookbook to update the needed images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) [12:28:25] (03PS2) 10David Caro: Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090) [12:28:27] (03PS2) 10David Caro: buildservice: Add some sal logs when updating the base images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) [12:29:13] (03PS2) 10Ladsgroup: Add guw to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/770565 (https://phabricator.wikimedia.org/T303727) (owner: 10Gerrit maintenance bot) [12:29:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add guw to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/770565 (https://phabricator.wikimedia.org/T303727) (owner: 10Gerrit maintenance bot) [12:31:16] (03PS1) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [12:31:23] (03CR) 10David Caro: [C: 03+2] buildservice: Add some sal logs when updating the base images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:31:27] (03CR) 10David Caro: [C: 03+2] Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:31:30] (03CR) 10David Caro: [C: 03+2] [buildservice] Add a cookbook to update the needed images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:32:38] (03CR) 10jerkins-bot: [V: 04-1] varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [12:34:32] (03Merged) 10jenkins-bot: [buildservice] Add a cookbook to update the needed images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770519 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:34:34] (03Merged) 10jenkins-bot: Refactor dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770547 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:34:36] (03Merged) 10jenkins-bot: buildservice: Add some sal logs when updating the base images [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770548 (https://phabricator.wikimedia.org/T297090) (owner: 10David Caro) [12:35:02] (03PS2) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [12:36:20] (03CR) 10Ladsgroup: [C: 04-1] change_page_touched_T298557.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) (owner: 10Marostegui) [12:36:23] (03CR) 10jerkins-bot: [V: 04-1] varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [12:38:24] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) (owner: 10Awight) [12:38:31] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 (owner: 10Awight) [12:39:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22533 and previous config saved to /var/cache/conftool/dbconfig/20220315-123926-marostegui.json [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:51] (03PS3) 10Marostegui: change_page_touched_T298557.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) [12:41:58] (03CR) 10Marostegui: change_page_touched_T298557.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) (owner: 10Marostegui) [12:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22534 and previous config saved to /var/cache/conftool/dbconfig/20220315-124253-ladsgroup.json [12:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22535 and previous config saved to /var/cache/conftool/dbconfig/20220315-124342-marostegui.json [12:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:46] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:45:38] (03CR) 10Ladsgroup: [C: 03+1] change_page_touched_T298557.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) (owner: 10Marostegui) [12:47:35] (03CR) 10Marostegui: [C: 03+2] change_page_touched_T298557.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) (owner: 10Marostegui) [12:47:58] (03Merged) 10jenkins-bot: change_page_touched_T298557.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/770832 (https://phabricator.wikimedia.org/T298557) (owner: 10Marostegui) [12:48:43] !log removed 170 corrupt rows in flaggedtemplates in dewiki (T297189) [12:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:47] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [12:52:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:52:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:28] (03CR) 10Jbond: "LGTM however it would be nice to split this into two patches 1 to add the new path and a second one to move the vendor modules to that pat" [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [12:52:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298557)', diff saved to https://phabricator.wikimedia.org/P22536 and previous config saved to /var/cache/conftool/dbconfig/20220315-125228-marostegui.json [12:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:33] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:54:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22537 and previous config saved to /var/cache/conftool/dbconfig/20220315-125431-marostegui.json [12:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22538 and previous config saved to /var/cache/conftool/dbconfig/20220315-125758-ladsgroup.json [12:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [12:58:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P22539 and previous config saved to /var/cache/conftool/dbconfig/20220315-125847-marostegui.json [12:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:53] (03PS3) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: Dear deployers, time to do the UTC afternoon backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T1300). [13:00:05] WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:48] I can deploy. [13:00:56] I'm here too [13:00:59] but awight go ahead :) [13:01:04] awight: Go for it [13:01:13] :-) I was also going to do some beta configs afterwards. [13:01:16] (03CR) 10jerkins-bot: [V: 04-1] varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [13:02:02] (03CR) 10Awight: [C: 03+2] "Deploying." [extensions/TemplateWizard] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770057 (https://phabricator.wikimedia.org/T303524) (owner: 10WMDE-Fisch) [13:04:10] (03Merged) 10jenkins-bot: Fix copy-paste mistake in template search widget [extensions/TemplateWizard] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/770057 (https://phabricator.wikimedia.org/T303524) (owner: 10WMDE-Fisch) [13:07:02] !log removed 440 more corrupt rows in flaggedtemplates in dewiki (T297189) [13:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:06] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [13:08:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:58] WMDE-Fisch: ready to test on mwdebug1001 [13:09:35] awight: currently still looking for some good test case -.- [13:09:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22540 and previous config saved to /var/cache/conftool/dbconfig/20220315-130936-marostegui.json [13:09:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:09:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:41] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [13:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:59] WMDE-Fisch: dewiki "Legende Kulturdenkmal ..." maybe [13:11:27] (03PS4) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [13:11:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:11:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:17] (03PS2) 10Arturo Borrero Gonzalez: wmcs: openstack: reduce verbosity and clean up set-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770888 [13:12:21] awight: I see a difference [13:12:24] so it works [13:12:37] thanks, continuing now [13:12:47] (03CR) 10jerkins-bot: [V: 04-1] varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [13:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298743)', diff saved to https://phabricator.wikimedia.org/P22541 and previous config saved to /var/cache/conftool/dbconfig/20220315-131303-ladsgroup.json [13:13:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:13:07] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [13:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298743)', diff saved to https://phabricator.wikimedia.org/P22542 and previous config saved to /var/cache/conftool/dbconfig/20220315-131311-ladsgroup.json [13:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:31] (03CR) 10Elukey: [C: 03+1] "Cole: I think that this change is safe to rollout, I am trying to generate the ca bundle jks in deployment prep but so race condition seem" [puppet] - 10https://gerrit.wikimedia.org/r/770892 (owner: 10Jbond) [13:13:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P22543 and previous config saved to /var/cache/conftool/dbconfig/20220315-131352-marostegui.json [13:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:23] (03CR) 10WMDE-Fisch: [C: 03+1] [beta] Remove unused config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 (owner: 10Awight) [13:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1095:3315', diff saved to https://phabricator.wikimedia.org/P22544 and previous config saved to /var/cache/conftool/dbconfig/20220315-131436-marostegui.json [13:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:45] (03CR) 10WMDE-Fisch: [C: 03+1] [beta] Disable new sidebar and improved template search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) (owner: 10Awight) [13:15:07] !log awight@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/TemplateWizard/resources/ext.TemplateWizard.SearchField.js: Backport: [[gerrit:770057|Fix copy-paste mistake in template search widget (T303524)]] (duration: 00m 49s) [13:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:10] T303524: TemplateWizard: Template not found on Wikimedia Commons - https://phabricator.wikimedia.org/T303524 [13:15:38] awight: beta patches are reviewed ;-) [13:15:46] (03PS5) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [13:16:18] WMDE-Fisch: ty--after discussing with lilients, I'm making a small change there... [13:17:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298743)', diff saved to https://phabricator.wikimedia.org/P22545 and previous config saved to /var/cache/conftool/dbconfig/20220315-131736-ladsgroup.json [13:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298557)', diff saved to https://phabricator.wikimedia.org/P22546 and previous config saved to /var/cache/conftool/dbconfig/20220315-132005-marostegui.json [13:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:10] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:20:14] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: reduce verbosity and clean up unset-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770909 [13:20:47] (03PS2) 10Awight: [beta] Remove unused config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 [13:20:49] (03PS2) 10Awight: [beta] Disable new sidebar and improved template search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) [13:20:56] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye [13:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:45] (03PS1) 10Muehlenhoff: Update point of contact for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/770910 (https://phabricator.wikimedia.org/T294484) [13:22:07] (03PS3) 10Awight: [beta] Remove unused config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 [13:22:09] (03PS3) 10Awight: [beta] Disable new sidebar and improved template search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) [13:22:29] WMDE-Fisch: Feel free to re-review [13:22:40] (03PS3) 10Arturo Borrero Gonzalez: wmcs: openstack: reduce verbosity and clean up set-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770888 [13:22:42] (03PS2) 10Arturo Borrero Gonzalez: wmcs: openstack: reduce verbosity and clean up unset-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770909 [13:24:30] (03CR) 10WMDE-Fisch: [C: 03+1] [beta] Remove unused config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 (owner: 10Awight) [13:24:40] (03CR) 10WMDE-Fisch: [C: 03+1] "nice :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) (owner: 10Awight) [13:24:47] awight: even better [13:26:41] (03PS6) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [13:26:59] (03PS1) 10Elukey: Set simpler partman recipe for kubernetes200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/770912 (https://phabricator.wikimedia.org/T300744) [13:28:41] (03CR) 10Awight: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 (owner: 10Awight) [13:28:43] (03CR) 10Svantje Lilienthal: [C: 03+1] [beta] Remove unused config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 (owner: 10Awight) [13:28:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22547 and previous config saved to /var/cache/conftool/dbconfig/20220315-132857-marostegui.json [13:28:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:29:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:03] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:17] (03PS4) 10Awight: [beta] Disable improved template search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) [13:29:19] (03Merged) 10jenkins-bot: [beta] Remove unused config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770906 (owner: 10Awight) [13:29:51] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34303/console" [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [13:29:53] (03CR) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [13:31:26] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) (owner: 10Awight) [13:31:35] !log awight@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:770906|[beta] Remove unused config overrides]] (duration: 00m 49s) [13:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:00] (03Merged) 10jenkins-bot: [beta] Disable improved template search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770907 (https://phabricator.wikimedia.org/T286991) (owner: 10Awight) [13:32:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22548 and previous config saved to /var/cache/conftool/dbconfig/20220315-133241-ladsgroup.json [13:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:57] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:770907|[beta] Disable improved template search (T286991, T302857)]] (duration: 00m 48s) [13:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:01] T286991: Deploy inline descriptions, extended sidebar and bigger dialog to all wikis (except enwiki) - https://phabricator.wikimedia.org/T286991 [13:33:02] T302857: Deploy first template focus-area improvements to enwiki - https://phabricator.wikimedia.org/T302857 [13:34:17] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:770907|[beta] Disable improved template search (T286991, T302857)]] (take 2) (duration: 00m 50s) [13:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22549 and previous config saved to /var/cache/conftool/dbconfig/20220315-133510-marostegui.json [13:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:35:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:46] !log EU deployment complete [13:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:38] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:01] (03PS1) 10Muehlenhoff: More stretch tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/770914 [13:41:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:35] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [13:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update point of contact for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/770910 (https://phabricator.wikimedia.org/T294484) (owner: 10Muehlenhoff) [13:42:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:42:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:38] (03PS2) 10Elukey: Set simpler partman recipe for kubernetes200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/770912 (https://phabricator.wikimedia.org/T300744) [13:46:23] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 fails to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) Hey @Cmjohnson and or @RobH can we get a firmware update on this host and also on cloudvirt1023 which is exhibiting a different out-of-date-firmware issue?... [13:47:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22550 and previous config saved to /var/cache/conftool/dbconfig/20220315-134747-ladsgroup.json [13:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:14] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10ayounsi) >>! In T303773#7777034, @Aklapper wrote: > [Please add project tags under project tags instead of subscribers - thanks!] Off topic, but should we... [13:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22551 and previous config saved to /var/cache/conftool/dbconfig/20220315-135015-marostegui.json [13:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:12] (03CR) 10Elukey: [C: 03+2] kserve-inference: fix custom image template variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/770895 (owner: 10Elukey) [13:53:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] Set simpler partman recipe for kubernetes200[5,6] (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770912 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:54:44] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [13:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:45] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Increase max.incremental.fetch.session.cache.slots on kafka jumbo to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/770505 (https://phabricator.wikimedia.org/T303324) (owner: 10Ottomata) [13:57:13] (03CR) 10Elukey: Set simpler partman recipe for kubernetes200[5,6] (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770912 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:57:40] (03CR) 10JMeybohm: [C: 04-1] "I mainly checked the datahub-fontend but I would assume that some of the comments apply to gms (maybe even consumer charts) as well. Pleas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:58:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1023.eqiad.wmnet with reason: host reimage [13:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:25] !log roll restarting kafka jumbo brokers to set max.incremental.fetch.session.cache.slots=2000 - T303324 [13:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:29] T303324: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 [13:59:39] (03CR) 10Ladsgroup: [C: 03+1] Update point of contact for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/770910 (https://phabricator.wikimedia.org/T294484) (owner: 10Muehlenhoff) [14:00:16] !log otto@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [14:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:38] openjdk upgrade? that's not whey :/ [14:00:39] why [14:00:40] oh well [14:00:56] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:55] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10Blablubbs) > Of course you could also take a slightly wider view and say that a list of global blocks on Wikimedia wikis provi... [14:02:09] ottomata: yes it is hardcoded in the cookbook, needs to be more general :) [14:02:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298743)', diff saved to https://phabricator.wikimedia.org/P22552 and previous config saved to /var/cache/conftool/dbconfig/20220315-140252-ladsgroup.json [14:02:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1143.eqiad.wmnet with reason: Maintenance [14:02:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1143.eqiad.wmnet with reason: Maintenance [14:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:56] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [14:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298743)', diff saved to https://phabricator.wikimedia.org/P22553 and previous config saved to /var/cache/conftool/dbconfig/20220315-140259-ladsgroup.json [14:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:06] (03PS2) 10Elukey: ml-services: update arwiki editquality predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/770886 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [14:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298557)', diff saved to https://phabricator.wikimedia.org/P22554 and previous config saved to /var/cache/conftool/dbconfig/20220315-140520-marostegui.json [14:05:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:05:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:25] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [14:06:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22555 and previous config saved to /var/cache/conftool/dbconfig/20220315-140634-root.json [14:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298743)', diff saved to https://phabricator.wikimedia.org/P22556 and previous config saved to /var/cache/conftool/dbconfig/20220315-140723-ladsgroup.json [14:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:10:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 10 hosts with reason: Maintenance [14:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 10 hosts with reason: Maintenance [14:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:53] (03CR) 10Kevin Bazira: "Thank you for working on this, Aiko." [deployment-charts] - 10https://gerrit.wikimedia.org/r/770886 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [14:20:15] (03CR) 10Elukey: [C: 03+2] ml-services: update arwiki editquality predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/770886 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [14:20:39] (03CR) 10Vgutierrez: [C: 04-1] "tests aren't working. confd fails to run:" [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [14:21:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22557 and previous config saved to /var/cache/conftool/dbconfig/20220315-142138-root.json [14:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:01] (03CR) 10Vgutierrez: [C: 04-1] "conf-reload-vcl is currently being dropped by varnish::common::director_scripts, so we need to add that file to our Dockerfile provisionin" [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [14:22:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22558 and previous config saved to /var/cache/conftool/dbconfig/20220315-142228-ladsgroup.json [14:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:47] !log T303256 bking@cumin1001 restarting wdqs services `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-blazegraph` [14:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:50] T303256: WDQS servers should use skolem for wikibaseSomeValueMode - https://phabricator.wikimedia.org/T303256 [14:23:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1023.eqiad.wmnet with OS bullseye [14:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:03] (03Merged) 10jenkins-bot: ml-services: update arwiki editquality predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/770886 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [14:24:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1068.eqiad.wmnet with OS stretch [14:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:32] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1068.eqiad.wmnet with OS stretch [14:31:03] (03PS1) 10Andrew Bogott: Rename nics for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/770922 (https://phabricator.wikimedia.org/T281276) [14:31:58] (03PS8) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [14:32:20] !log ntsako@deploy1002 Started deploy [airflow-dags/analytics@2924232]: (no justification provided) [14:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:28] !log ntsako@deploy1002 Finished deploy [airflow-dags/analytics@2924232]: (no justification provided) (duration: 00m 08s) [14:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:25] (03CR) 10Ottomata: [C: 03+2] Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [14:36:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22559 and previous config saved to /var/cache/conftool/dbconfig/20220315-143642-root.json [14:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:07] !log otto@cumin1001 END (ERROR) - Cookbook sre.kafka.roll-restart-brokers (exit_code=97) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [14:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22560 and previous config saved to /var/cache/conftool/dbconfig/20220315-143733-ladsgroup.json [14:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:40] !log accidental cancel of roll restart brokers, re-doing - T303324 [14:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:43] T303324: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 [14:38:11] (03CR) 10Andrew Bogott: [C: 03+2] Rename nics for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/770922 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [14:38:51] !log all brokers except kafka-jumbo1001 were succesffully roll restarted, doing kafka-jumbo1001 manually - T303324 [14:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:32] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:34] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1068.eqiad.wmnet with reason: host reimage [14:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:21] !log I read the cumin output wrong, kafka-jumbo1001 and 1002 restarted successfully before accidental ctrl-c on cumin command. Restarting the full jumbo roll-restart to thoroughly do them all - T303324 [14:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:14] !log otto@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [14:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:15] (03PS1) 10Jbond: P:java: Use lower case titles [puppet] - 10https://gerrit.wikimedia.org/r/770923 [14:48:48] (03CR) 10Jbond: [C: 03+2] P:java: Use lower case titles [puppet] - 10https://gerrit.wikimedia.org/r/770923 (owner: 10Jbond) [14:49:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34305/console" [puppet] - 10https://gerrit.wikimedia.org/r/770923 (owner: 10Jbond) [14:49:45] !log ntsako@deploy1002 Started deploy [airflow-dags/analytics@88d5618]: (no justification provided) [14:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:53] !log ntsako@deploy1002 Finished deploy [airflow-dags/analytics@88d5618]: (no justification provided) (duration: 00m 07s) [14:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:27] !log installing postgresql-11 security updates [14:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22561 and previous config saved to /var/cache/conftool/dbconfig/20220315-145146-root.json [14:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298743)', diff saved to https://phabricator.wikimedia.org/P22562 and previous config saved to /var/cache/conftool/dbconfig/20220315-145238-ladsgroup.json [14:52:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:52:42] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [14:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298743)', diff saved to https://phabricator.wikimedia.org/P22563 and previous config saved to /var/cache/conftool/dbconfig/20220315-145246-ladsgroup.json [14:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298743)', diff saved to https://phabricator.wikimedia.org/P22564 and previous config saved to /var/cache/conftool/dbconfig/20220315-150116-ladsgroup.json [15:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:21] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [15:04:35] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6011 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 179725 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-06-03 16:37:41 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:04:53] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6011 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 179706 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-06-03 16:37:41 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22565 and previous config saved to /var/cache/conftool/dbconfig/20220315-150649-root.json [15:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:25] RECOVERY - puppet last run on cp6011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:09:35] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@f01214c]: (no justification provided) [15:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:42] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@f01214c]: (no justification provided) (duration: 00m 07s) [15:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:12:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298563)', diff saved to https://phabricator.wikimedia.org/P22566 and previous config saved to /var/cache/conftool/dbconfig/20220315-151206-marostegui.json [15:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:10] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [15:14:17] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, cp6011 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:14:19] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, cp6011 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:15:10] (03PS3) 10JHathaway: Prepare to move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423) [15:16:04] (03CR) 10JHathaway: Prepare to move vendored modules to vendor_modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [15:16:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22567 and previous config saved to /var/cache/conftool/dbconfig/20220315-151621-ladsgroup.json [15:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:58] (03PS4) 10JHathaway: Prepare to move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423) [15:18:05] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:18:07] !log installing Java updates on wcqs*/wdqs* hosts [15:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:03] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:22:07] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/770892 (owner: 10Jbond) [15:24:58] (03PS1) 10DCausse: [wdqs] cleanup the udpater setup logic [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) [15:25:32] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] cleanup the udpater setup logic [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [15:26:44] (03CR) 10Muehlenhoff: [C: 03+2] More stretch tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/770914 (owner: 10Muehlenhoff) [15:27:08] (03PS2) 10DCausse: [wdqs] cleanup the udpater setup logic [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) [15:27:10] (03PS1) 10Jbond: C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770952 [15:28:09] (03CR) 10jerkins-bot: [V: 04-1] C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770952 (owner: 10Jbond) [15:29:01] (03CR) 10David Caro: [C: 03+1] wmcs: openstack: reduce verbosity and clean up unset-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770909 (owner: 10Arturo Borrero Gonzalez) [15:29:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [15:29:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [15:29:12] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [15:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T298557)', diff saved to https://phabricator.wikimedia.org/P22568 and previous config saved to /var/cache/conftool/dbconfig/20220315-152916-marostegui.json [15:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:20] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:29:21] (03CR) 10David Caro: [C: 03+1] wmcs: openstack: reduce verbosity and clean up set-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770888 (owner: 10Arturo Borrero Gonzalez) [15:29:42] (03PS3) 10DCausse: [wdqs] cleanup the udpater setup logic [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) [15:31:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P22569 and previous config saved to /var/cache/conftool/dbconfig/20220315-153126-ladsgroup.json [15:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:55] (03PS2) 10Jbond: C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770952 [15:32:37] (03CR) 10jerkins-bot: [V: 04-1] C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770952 (owner: 10Jbond) [15:35:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34308/console" [puppet] - 10https://gerrit.wikimedia.org/r/770952 (owner: 10Jbond) [15:35:25] (03PS4) 10Elukey: O:kafka::logging: ensure that all base classes are initiated first [puppet] - 10https://gerrit.wikimedia.org/r/770892 (owner: 10Jbond) [15:36:13] (03PS3) 10Jbond: C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770952 [15:38:09] (03CR) 10Jbond: [C: 03+2] C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770952 (owner: 10Jbond) [15:38:34] (03CR) 10Elukey: [C: 03+1] C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770952 (owner: 10Jbond) [15:39:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34309/console" [puppet] - 10https://gerrit.wikimedia.org/r/770952 (owner: 10Jbond) [15:39:51] (03PS1) 10Jbond: Revert "C:java: update dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/770927 [15:39:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "C:java: update dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/770927 (owner: 10Jbond) [15:40:20] jbond: :( [15:40:54] (03PS1) 10Jbond: C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770928 [15:46:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298563)', diff saved to https://phabricator.wikimedia.org/P22570 and previous config saved to /var/cache/conftool/dbconfig/20220315-154610-marostegui.json [15:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:15] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [15:46:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298743)', diff saved to https://phabricator.wikimedia.org/P22571 and previous config saved to /var/cache/conftool/dbconfig/20220315-154631-ladsgroup.json [15:46:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1147.eqiad.wmnet with reason: Maintenance [15:46:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1147.eqiad.wmnet with reason: Maintenance [15:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:35] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [15:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298743)', diff saved to https://phabricator.wikimedia.org/P22572 and previous config saved to /var/cache/conftool/dbconfig/20220315-154639-ladsgroup.json [15:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298743)', diff saved to https://phabricator.wikimedia.org/P22573 and previous config saved to /var/cache/conftool/dbconfig/20220315-155102-ladsgroup.json [15:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:36] !log updating Exim on mx1001 T303738 [15:53:44] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01171 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:22] (03CR) 10Elukey: "What is the difference with https://gerrit.wikimedia.org/r/c/operations/puppet/+/770927 ?" [puppet] - 10https://gerrit.wikimedia.org/r/770928 (owner: 10Jbond) [15:55:47] (03PS2) 10Jbond: C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770928 [15:57:06] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Is it possible to put more RAM in cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet? - https://phabricator.wikimedia.org/T303840 (10Majavah) [15:57:26] jbond: apparently there are dependency cycles with the last puppet patch [15:57:37] for java__cacert_wmf [15:57:44] see https://puppetboard.wikimedia.org/nodes?status=failed [15:58:45] (03CR) 10Jbond: [C: 03+2] C:java: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/770928 (owner: 10Jbond) [16:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T1600). [16:00:04] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:53] o/ [16:01:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22574 and previous config saved to /var/cache/conftool/dbconfig/20220315-160116-marostegui.json [16:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298557)', diff saved to https://phabricator.wikimedia.org/P22575 and previous config saved to /var/cache/conftool/dbconfig/20220315-160226-marostegui.json [16:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:35] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [16:03:32] zabe: hey, sorry, I have a meeting conflict today -- I hope jbond can take care of you, otherwise I'll have a look later on [16:03:53] no problem, I'm around :) [16:06:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22576 and previous config saved to /var/cache/conftool/dbconfig/20220315-160607-ladsgroup.json [16:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:11] (03CR) 10Muehlenhoff: [C: 03+2] Require Python 3.7/buster for logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/767064 (owner: 10Muehlenhoff) [16:11:18] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Is it possible to put more RAM in cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet? - https://phabricator.wikimedia.org/T303840 (10Papaul) 1. Yes there is physical room fir RAM expansion. A1,A2 and B1,B2 each has 32GB so you the option to use... [16:13:23] (03PS2) 10AOkoth: vrts: rename mail module class variables [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) [16:13:46] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001597 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:14:28] (03PS1) 10JHathaway: Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) [16:14:37] (03CR) 10AOkoth: vrts: rename mail module class variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769998 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [16:15:15] (03PS2) 10JHathaway: Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) [16:16:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22577 and previous config saved to /var/cache/conftool/dbconfig/20220315-161621-marostegui.json [16:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:32] (03CR) 10jerkins-bot: [V: 04-1] Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [16:16:58] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:17:06] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22578 and previous config saved to /var/cache/conftool/dbconfig/20220315-161732-marostegui.json [16:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:03] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770961 [16:20:05] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770961 (owner: 10Jeena Huneidi) [16:20:48] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770961 (owner: 10Jeena Huneidi) [16:20:51] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.26 refs T300202 [16:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:55] T300202: 1.38.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T300202 [16:21:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P22579 and previous config saved to /var/cache/conftool/dbconfig/20220315-162113-ladsgroup.json [16:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:17] zabe: available after all! looking [16:22:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: reduce verbosity and clean up set-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770888 (owner: 10Arturo Borrero Gonzalez) [16:22:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: reduce verbosity and clean up unset-maintenance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770909 (owner: 10Arturo Borrero Gonzalez) [16:23:49] (03CR) 10RLazarus: [C: 03+2] wikitech_private: stop writing to wmf* constants [puppet] - 10https://gerrit.wikimedia.org/r/770102 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:24:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] Set simpler partman recipe for kubernetes200[5,6] (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770912 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:24:25] zabe: merging now -- to test this, I'll just run puppet on the cloudweb2001-dev and labweb*, and make sure wikitech is still there -- right? :) [16:25:08] yeah [16:25:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:42] (03PS1) 10Jbond: C:java: use correct defined call [puppet] - 10https://gerrit.wikimedia.org/r/770964 [16:26:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:26:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:04] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/770880 (owner: 10Majavah) [16:28:15] (03PS1) 10Cathal Mooney: Change _get_underlay_ints() to use fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/770966 (https://phabricator.wikimedia.org/T299758) [16:28:37] zabe: merged and deployed, lgty? [16:28:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34312/console" [puppet] - 10https://gerrit.wikimedia.org/r/770964 (owner: 10Jbond) [16:29:15] yes, thanks for your help [16:29:26] 👍 thanks for the patch! sorry about the delay [16:29:38] (03CR) 10Volans: [C: 03+1] "lgtm, thanks for the fix" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/770966 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:30:53] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10Tks4Fish) >>! In T303774#7776806, @MZMcBride wrote: > That's a good question. @Tks4Fish may know why we aren't using a dedicat... [16:31:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298563)', diff saved to https://phabricator.wikimedia.org/P22580 and previous config saved to /var/cache/conftool/dbconfig/20220315-163126-marostegui.json [16:31:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:31:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:31] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [16:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298563)', diff saved to https://phabricator.wikimedia.org/P22581 and previous config saved to /var/cache/conftool/dbconfig/20220315-163134-marostegui.json [16:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22582 and previous config saved to /var/cache/conftool/dbconfig/20220315-163238-marostegui.json [16:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:03] (03CR) 10Jbond: gitlab_runner: restrict docker traffic with additional ferm rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:33:10] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:java: use correct defined call [puppet] - 10https://gerrit.wikimedia.org/r/770964 (owner: 10Jbond) [16:35:24] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [16:35:33] (03PS7) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [16:35:42] (03CR) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [16:36:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298743)', diff saved to https://phabricator.wikimedia.org/P22583 and previous config saved to /var/cache/conftool/dbconfig/20220315-163618-ladsgroup.json [16:36:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:36:23] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [16:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298743)', diff saved to https://phabricator.wikimedia.org/P22584 and previous config saved to /var/cache/conftool/dbconfig/20220315-163626-ladsgroup.json [16:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:50] (03CR) 10Cathal Mooney: [C: 03+2] Change _get_underlay_ints() to use fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/770966 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:38:53] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Change _get_underlay_ints() to use fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/770966 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:40:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298743)', diff saved to https://phabricator.wikimedia.org/P22585 and previous config saved to /var/cache/conftool/dbconfig/20220315-164053-ladsgroup.json [16:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:10] PROBLEM - Ensure local MW versions match expected deployment on mw1415 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:46:20] PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:46:27] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) a:03Joe [16:46:30] PROBLEM - Ensure local MW versions match expected deployment on mw1418 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:46:58] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) Thanks for this task! Now that we have a clear path forward in T298087, it makes sens to focus on this one as wel... [16:47:22] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-03-15-002555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/770638 (https://phabricator.wikimedia.org/T268774) (owner: 10BryanDavis) [16:47:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298557)', diff saved to https://phabricator.wikimedia.org/P22586 and previous config saved to /var/cache/conftool/dbconfig/20220315-164743-marostegui.json [16:47:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:47:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:48] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [16:47:49] (03PS3) 10JHathaway: Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) [16:47:50] PROBLEM - Ensure local MW versions match expected deployment on mw1447 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:51] (03PS1) 10JHathaway: rsync: fix rubocop style violations [puppet] - 10https://gerrit.wikimedia.org/r/770969 (https://phabricator.wikimedia.org/T302423) [16:47:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298557)', diff saved to https://phabricator.wikimedia.org/P22587 and previous config saved to /var/cache/conftool/dbconfig/20220315-164751-marostegui.json [16:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:00] PROBLEM - Ensure local MW versions match expected deployment on mw1319 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:48:08] PROBLEM - Ensure local MW versions match expected deployment on mw1450 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:48:30] PROBLEM - Ensure local MW versions match expected deployment on mw1414 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [16:48:45] (03PS3) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [16:49:13] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34315/console" [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [16:51:20] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34316/console" [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [16:51:27] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-03-15-002555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/770638 (https://phabricator.wikimedia.org/T268774) (owner: 10BryanDavis) [16:51:53] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34317/console" [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [16:52:02] RECOVERY - Ensure local MW versions match expected deployment on mw1415 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:52:39] (03CR) 10Vgutierrez: [V: 03+1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [16:53:02] (03CR) 10Klausman: Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [16:53:12] RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:53:22] RECOVERY - Ensure local MW versions match expected deployment on mw1418 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:54:44] RECOVERY - Ensure local MW versions match expected deployment on mw1447 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:54:54] RECOVERY - Ensure local MW versions match expected deployment on mw1319 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:55:02] RECOVERY - Ensure local MW versions match expected deployment on mw1450 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:55:22] RECOVERY - Ensure local MW versions match expected deployment on mw1414 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:55:58] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 (owner: 10Giuseppe Lavagetto) [16:55:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22588 and previous config saved to /var/cache/conftool/dbconfig/20220315-165558-ladsgroup.json [16:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:19] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: add cookbook to automate deploying custom components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770971 (https://phabricator.wikimedia.org/T291915) [16:57:34] (03CR) 10Klausman: [C: 03+2] Add etcd setup for ML staging cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [16:57:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:28] (03CR) 10Jbond: [C: 03+1] "LGTM, I think this change will trigger a reload of apache on puppet which can cause a bunch of puppet failures. as such i would do the fo" [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [16:58:40] (03CR) 10Elukey: Add etcd setup for ML staging cluster in codfw (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770522 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [16:58:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/770969 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [16:59:45] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.26 refs T300202 (duration: 38m 54s) [16:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:49] T300202: 1.38.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T300202 [17:00:01] (03CR) 10Jbond: "LGTM but lets get a pcc first (i still didn't get to check the pcc code yet)" [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [17:00:04] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Toolhub. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T1700). [17:01:31] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.38.0-wmf.24 (duration: 01m 32s) [17:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [17:01:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [17:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T300775)', diff saved to https://phabricator.wikimedia.org/P22589 and previous config saved to /var/cache/conftool/dbconfig/20220315-170201-marostegui.json [17:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:05] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [17:03:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: k8s: add cookbook to automate deploying custom components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770971 (https://phabricator.wikimedia.org/T291915) (owner: 10Arturo Borrero Gonzalez) [17:04:08] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10Bugreporter) >>! In T303774#7778438, @Tks4Fish wrote: >>>! In T303774#7776806, @MZMcBride wrote: >> That's a good question. @T... [17:04:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:04:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:52] (03PS1) 10Klausman: Fix directory structure/role name for ML staging etcd [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) [17:06:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298563)', diff saved to https://phabricator.wikimedia.org/P22590 and previous config saved to /var/cache/conftool/dbconfig/20220315-170614-marostegui.json [17:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:19] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [17:08:46] (03PS2) 10Klausman: Fix directory structure/role name for ML staging etcd [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) [17:09:56] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34319/ml-staging-etcd2001.codfw.wmnet/change.ml-staging-etcd2001.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [17:10:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P22591 and previous config saved to /var/cache/conftool/dbconfig/20220315-171103-ladsgroup.json [17:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:18] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Joe) I would tentatively grant full docker access to the people listed above, but to be revisited at a later time, maybe creating a... [17:11:52] (03CR) 10Elukey: Fix directory structure/role name for ML staging etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [17:12:19] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [17:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:30] (03PS1) 10Giuseppe Lavagetto: admin: add releng to docker group on deployment [puppet] - 10https://gerrit.wikimedia.org/r/770976 (https://phabricator.wikimedia.org/T303450) [17:12:32] (03PS3) 10Klausman: Fix directory structure/role name for ML staging etcd [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) [17:13:28] (03PS4) 10Klausman: Fix directory structure/role name for ML staging etcd [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) [17:13:37] (03CR) 10Klausman: Fix directory structure/role name for ML staging etcd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [17:14:10] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34321/console" [puppet] - 10https://gerrit.wikimedia.org/r/770976 (https://phabricator.wikimedia.org/T303450) (owner: 10Giuseppe Lavagetto) [17:15:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:03] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34323/console" [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [17:16:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34324/console" [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [17:17:30] (03CR) 10Klausman: [V: 03+1 C: 03+2] Fix directory structure/role name for ML staging etcd [puppet] - 10https://gerrit.wikimedia.org/r/770973 (https://phabricator.wikimedia.org/T302197) (owner: 10Klausman) [17:17:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:18:38] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298557)', diff saved to https://phabricator.wikimedia.org/P22592 and previous config saved to /var/cache/conftool/dbconfig/20220315-172027-marostegui.json [17:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:31] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [17:21:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22593 and previous config saved to /var/cache/conftool/dbconfig/20220315-172119-marostegui.json [17:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:22:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:55] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:24:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [17:25:47] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [17:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298743)', diff saved to https://phabricator.wikimedia.org/P22594 and previous config saved to /var/cache/conftool/dbconfig/20220315-172608-ladsgroup.json [17:26:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:26:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:14] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [17:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298743)', diff saved to https://phabricator.wikimedia.org/P22595 and previous config saved to /var/cache/conftool/dbconfig/20220315-172616-ladsgroup.json [17:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:16] (03PS4) 10DCausse: [wdqs] cleanup the udpater setup logic [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) [17:28:18] (03PS1) 10DCausse: [wdqs] add jvmquake options to wdqs1010 for testing [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) [17:28:55] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) You have successfully submitted request SR1087324652. [17:29:48] (03PS1) 10Klausman: Add missing ml staging etcd cert [puppet] - 10https://gerrit.wikimedia.org/r/770980 [17:32:27] (03CR) 10Elukey: [C: 03+1] Add missing ml staging etcd cert [puppet] - 10https://gerrit.wikimedia.org/r/770980 (owner: 10Klausman) [17:33:28] (03CR) 10Klausman: [C: 03+2] Add missing ml staging etcd cert [puppet] - 10https://gerrit.wikimedia.org/r/770980 (owner: 10Klausman) [17:34:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [17:35:24] (03PS20) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:35:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22596 and previous config saved to /var/cache/conftool/dbconfig/20220315-173532-marostegui.json [17:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:14] (03PS9) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:36:17] (03PS1) 10Jbond: PoC: pass parameteres from hiera to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/770981 [17:36:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22597 and previous config saved to /var/cache/conftool/dbconfig/20220315-173625-marostegui.json [17:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:48] (03CR) 10jerkins-bot: [V: 04-1] PoC: pass parameteres from hiera to idp vhost [puppet] - 10https://gerrit.wikimedia.org/r/770981 (owner: 10Jbond) [17:38:44] (03CR) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:39:06] (03PS1) 10DCausse: team-search-platform: relax RdfStreamingUpdaterFlinkProcessingLatencyIsHigh [alerts] - 10https://gerrit.wikimedia.org/r/770982 [17:41:54] (03CR) 10jerkins-bot: [V: 04-1] team-search-platform: relax RdfStreamingUpdaterFlinkProcessingLatencyIsHigh [alerts] - 10https://gerrit.wikimedia.org/r/770982 (owner: 10DCausse) [17:43:03] (03PS1) 10Hnowlan: install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) [17:43:24] (03PS1) 10Jbond: Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 [17:44:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34327/console" [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond) [17:44:19] (03PS2) 10DCausse: team-search-platform: relax RdfStreamingUpdaterFlinkProcessingLatencyIsHigh [alerts] - 10https://gerrit.wikimedia.org/r/770982 [17:45:19] (03CR) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:46:11] (03PS21) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:46:13] (03PS10) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:46:20] (03CR) 10DCausse: [C: 03+1] wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614 (owner: 10Ryan Kemper) [17:46:43] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [17:46:50] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [17:48:12] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: add cookbook to automate deploying components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770986 (https://phabricator.wikimedia.org/T291915) [17:48:17] (03CR) 10Jbond: [C: 03+2] O:kafka::logging: ensure that all base classes are initiated first [puppet] - 10https://gerrit.wikimedia.org/r/770892 (owner: 10Jbond) [17:49:07] (03PS8) 10Giuseppe Lavagetto: varnish: get blocked-nets from etcd [puppet] - 10https://gerrit.wikimedia.org/r/770905 [17:49:09] (03PS22) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:49:11] (03PS11) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:50:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22598 and previous config saved to /var/cache/conftool/dbconfig/20220315-175037-marostegui.json [17:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298563)', diff saved to https://phabricator.wikimedia.org/P22599 and previous config saved to /var/cache/conftool/dbconfig/20220315-175130-marostegui.json [17:51:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:51:34] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [17:51:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298563)', diff saved to https://phabricator.wikimedia.org/P22600 and previous config saved to /var/cache/conftool/dbconfig/20220315-175143-marostegui.json [17:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:26] !log otto@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [17:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:17] (03PS23) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:53:19] (03PS12) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:55:36] For anyone keeping score at home, the Toolhub deploy stopped at the staging cluster due to a dependent library making a breaking change against semver. The fix for that is working through CI now. I will resume my deploy work after that merges and I eat some lunch. :) [17:57:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: k8s: add cookbook to automate deploying components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770986 (https://phabricator.wikimedia.org/T291915) (owner: 10Arturo Borrero Gonzalez) [17:57:36] !log power down mr1-ulsfo for replacement [17:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:15] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10Majavah) Is there a reason this is limited to the releng team? T297673 says that all scap syncs will be build... [18:00:04] jeena and dancy: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T1800) [18:00:14] (03PS2) 10Jbond: Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 [18:00:36] (03Merged) 10jenkins-bot: wmcs: toolforge: k8s: add cookbook to automate deploying components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/770986 (https://phabricator.wikimedia.org/T291915) (owner: 10Arturo Borrero Gonzalez) [18:00:55] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:02:25] (03PS1) 10Jeena Huneidi: group0 wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770990 [18:02:27] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770990 (owner: 10Jeena Huneidi) [18:02:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:03:14] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770990 (owner: 10Jeena Huneidi) [18:03:37] (03PS3) 10Jbond: Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 [18:04:34] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.26 refs T300202 [18:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:38] T300202: 1.38.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T300202 [18:05:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298557)', diff saved to https://phabricator.wikimedia.org/P22601 and previous config saved to /var/cache/conftool/dbconfig/20220315-180542-marostegui.json [18:05:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:05:46] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [18:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [18:06:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1069.eqiad.wmnet with OS buster [18:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:16] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS buster [18:08:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298743)', diff saved to https://phabricator.wikimedia.org/P22602 and previous config saved to /var/cache/conftool/dbconfig/20220315-180850-ladsgroup.json [18:08:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1070.eqiad.wmnet with OS buster [18:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:54] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [18:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:56] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS buster [18:09:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1071.eqiad.wmnet with OS buster [18:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS buster [18:09:43] PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:43] PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:43] PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:43] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:43] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:43] PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:43] PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:49] (03PS5) 10Ryan Kemper: [wdqs] cleanup the updater setup logic [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [18:10:15] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:10:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:10:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:53] 10SRE, 10Thumbor, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10WDoranWMF) [18:12:46] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10WDoranWMF) [18:13:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6009.drmrs.wmnet with OS buster [18:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:36] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6009.drmrs.wmnet with OS buster [18:14:30] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10dancy) @Majavah The container build stuff happens under `sudo -u mwbuilder`, and `mwbuilder` does have permis... [18:14:41] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10WDoranWMF) [18:15:16] (03PS3) 10Ryan Kemper: wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614 [18:15:25] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10WDoranWMF) [18:15:32] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Add some users to the docker group on deployment servers - https://phabricator.wikimedia.org/T303450 (10dancy) To clarify further, releng needs direct access to dockerd (without having to sudo) while we're debuggi... [18:19:43] RECOVERY - Host cp4029.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 76.10 ms [18:19:43] RECOVERY - Host cp4033.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 74.68 ms [18:19:43] RECOVERY - Host cp4023.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 73.97 ms [18:19:43] RECOVERY - Host cp4026.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 87.59 ms [18:19:43] RECOVERY - Host cp4022.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 75.69 ms [18:19:44] RECOVERY - Host cp4024.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 73.95 ms [18:19:44] RECOVERY - Host cp4021.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 76.41 ms [18:19:45] RECOVERY - Host cp4027.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 76.56 ms [18:19:45] RECOVERY - Host cp4030.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 76.76 ms [18:19:46] RECOVERY - Host cp4025.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 76.48 ms [18:19:46] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.05 ms [18:19:47] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.50 ms [18:19:49] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 68.83 ms [18:19:51] RECOVERY - Host cp4035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 72.29 ms [18:19:51] RECOVERY - Host cp4036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.46 ms [18:19:51] RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.08 ms [18:20:09] RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 82.92 ms [18:20:13] RECOVERY - Host cp4034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 95.49 ms [18:20:13] RECOVERY - Host cr4-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 68.73 ms [18:20:13] RECOVERY - Host cr3-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 68.87 ms [18:20:13] RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.68 ms [18:20:13] RECOVERY - Host ganeti4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.86 ms [18:20:14] RECOVERY - Host ganeti4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.36 ms [18:20:14] RECOVERY - Host dns4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 82.79 ms [18:20:15] RECOVERY - Host ganeti4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.66 ms [18:20:15] RECOVERY - Host lvs4006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 76.09 ms [18:20:16] RECOVERY - Host ganeti4004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.55 ms [18:20:16] RECOVERY - Host lvs4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.50 ms [18:20:16] (03PS24) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [18:20:17] RECOVERY - Host scs-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 68.60 ms [18:20:18] (03PS13) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [18:20:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [18:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:52] (03PS4) 10Jbond: Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 [18:21:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [18:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [18:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [18:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P22603 and previous config saved to /var/cache/conftool/dbconfig/20220315-182355-ladsgroup.json [18:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [18:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:11] RECOVERY - Host cp4028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 68.70 ms [18:25:15] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:26:04] (03CR) 10JHathaway: [C: 03+2] Prepare to move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770099 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [18:26:22] (03CR) 10JHathaway: [C: 03+2] rsync: fix rubocop style violations [puppet] - 10https://gerrit.wikimedia.org/r/770969 (https://phabricator.wikimedia.org/T302423) (owner: 10JHathaway) [18:27:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [18:27:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298563)', diff saved to https://phabricator.wikimedia.org/P22604 and previous config saved to /var/cache/conftool/dbconfig/20220315-182711-marostegui.json [18:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:17] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [18:29:07] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be1070 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.134.3: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [18:29:07] PROBLEM - puppet last run on ms-be1070 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.134.3: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:29:34] (03PS1) 10Ayounsi: Upgrade mr1-ulsfo to SRX300 [homer/public] - 10https://gerrit.wikimedia.org/r/770998 [18:29:34] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1070.eqiad.wmnet with OS buster [18:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:39] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS buster executed with errors: -... [18:30:06] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1069.eqiad.wmnet with OS buster [18:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:11] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS buster executed with errors: -... [18:30:41] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1070 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.134.3: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:30:41] PROBLEM - very high load average likely xfs on ms-be1070 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.134.3: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [18:31:06] (03CR) 10Ayounsi: [C: 03+2] Upgrade mr1-ulsfo to SRX300 [homer/public] - 10https://gerrit.wikimedia.org/r/770998 (owner: 10Ayounsi) [18:31:36] (03Merged) 10jenkins-bot: Upgrade mr1-ulsfo to SRX300 [homer/public] - 10https://gerrit.wikimedia.org/r/770998 (owner: 10Ayounsi) [18:32:45] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10Ottomata) Deployed and restarted brokers. Let's watch fetch sessions and cache evictions over the next few days. [18:32:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1071.eqiad.wmnet with OS buster [18:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS buster executed with errors: -... [18:33:01] RECOVERY - very high load average likely xfs on ms-be1070 is OK: OK - load average: 0.48, 0.72, 0.54 https://wikitech.wikimedia.org/wiki/Swift [18:34:12] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: fix data-transfer usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/770614 (owner: 10Ryan Kemper) [18:35:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage [18:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:51] (03CR) 10Ryan Kemper: [C: 03+1] "All of cookbooks/sre/wdqs and cookbooks/sre/elasticsearch look good. So here's my +1 for the search team cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [18:37:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:17] (03PS4) 10JHathaway: Move vendored modules to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770960 (https://phabricator.wikimedia.org/T302423) [18:39:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P22605 and previous config saved to /var/cache/conftool/dbconfig/20220315-183900-ladsgroup.json [18:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6009.drmrs.wmnet with reason: host reimage [18:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:00] (03PS5) 10Jbond: Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 [18:42:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22606 and previous config saved to /var/cache/conftool/dbconfig/20220315-184216-marostegui.json [18:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1071.eqiad.wmnet with OS buster [18:43:22] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1070.eqiad.wmnet with OS stretch [18:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS buster [18:43:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1069.eqiad.wmnet with OS stretch [18:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:43] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS stretch [18:43:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch [18:43:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:43:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:30] (03PS1) 10Andrew Bogott: Fix invalid ref to last_backup_with_snapshot.valid [puppet] - 10https://gerrit.wikimedia.org/r/770999 (https://phabricator.wikimedia.org/T303870) [18:46:14] (03CR) 10Bking: [C: 03+1] "Search team playbooks look good, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/770456 (owner: 10Volans) [18:48:20] (03PS2) 10Andrew Bogott: Fix invalid ref to last_backup_with_snapshot.valid [puppet] - 10https://gerrit.wikimedia.org/r/770999 (https://phabricator.wikimedia.org/T303870) [18:50:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:55] (03Abandoned) 10Andrew Bogott: novaproxy: add redirects for wmfcloud.org and www.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/762074 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott) [18:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298743)', diff saved to https://phabricator.wikimedia.org/P22607 and previous config saved to /var/cache/conftool/dbconfig/20220315-185405-ladsgroup.json [18:54:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:54:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:10] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [18:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298743)', diff saved to https://phabricator.wikimedia.org/P22608 and previous config saved to /var/cache/conftool/dbconfig/20220315-185413-ladsgroup.json [18:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [18:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [18:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [18:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:13] (03PS1) 10Ahmon Dancy: mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 [18:57:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22609 and previous config saved to /var/cache/conftool/dbconfig/20220315-185721-marostegui.json [18:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:59] (03CR) 10jerkins-bot: [V: 04-1] mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (owner: 10Ahmon Dancy) [18:59:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [18:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:32] (03PS2) 10Ahmon Dancy: mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 [19:00:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1070.eqiad.wmnet with reason: host reimage [19:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:19] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [19:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:09] (03PS1) 10BryanDavis: toolhub: Bumg container version to 2022-03-15-181014-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/771002 [19:03:29] PROBLEM - Check size of conntrack table on ms-be1069 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.131.2: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:03:41] (03PS2) 10Ryan Kemper: [wdqs] add jvmquake options to wdqs1010 for testing [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:03:59] (03PS2) 10BryanDavis: toolhub: Bump container version to 2022-03-15-181014-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/771002 [19:06:03] PROBLEM - Check size of conntrack table on ms-be1069 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.131.2: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:06:05] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.131.2: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:07] PROBLEM - configured eth on ms-be1069 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.131.2: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:07:15] (03PS1) 10QChris: Add .gitreview [debs/istio] - 10https://gerrit.wikimedia.org/r/771005 [19:07:17] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/istio] - 10https://gerrit.wikimedia.org/r/771005 (owner: 10QChris) [19:10:21] PROBLEM - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:31] RECOVERY - Host ms-be1069 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:11:09] RECOVERY - Check size of conntrack table on ms-be1069 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:11:15] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298743)', diff saved to https://phabricator.wikimedia.org/P22610 and previous config saved to /var/cache/conftool/dbconfig/20220315-191140-ladsgroup.json [19:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:49] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [19:11:57] (03PS3) 10Ryan Kemper: [wdqs] add jvmquake options to wdqs1010 for testing [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:11:59] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-03-15-181014-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/771002 (owner: 10BryanDavis) [19:12:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6009.drmrs.wmnet with OS buster [19:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:25] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6009.drmrs.wmnet with OS buster completed: - cp6009 (**WARN**) -... [19:12:25] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298563)', diff saved to https://phabricator.wikimedia.org/P22611 and previous config saved to /var/cache/conftool/dbconfig/20220315-191226-marostegui.json [19:12:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:12:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:30] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [19:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22612 and previous config saved to /var/cache/conftool/dbconfig/20220315-191234-marostegui.json [19:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:37] (03PS4) 10Ryan Kemper: [wdqs] test jvmquake options on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:16:00] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-03-15-181014-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/771002 (owner: 10BryanDavis) [19:16:26] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:18:05] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [19:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:23] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [19:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:42] that was much nicer than the last attempt :) [19:22:41] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [19:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:49] (03PS3) 10Ahmon Dancy: mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 [19:24:42] (03PS6) 10Jbond: Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 [19:24:56] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [19:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:12] (03CR) 10Ahmon Dancy: "Timo, this is something I had discussed with you a while back about needing to run rebuildLocalisationCache.php, targeting a specific vers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (owner: 10Ahmon Dancy) [19:25:19] (03CR) 10jerkins-bot: [V: 04-1] Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond) [19:26:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P22614 and previous config saved to /var/cache/conftool/dbconfig/20220315-192647-ladsgroup.json [19:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:40] (03PS7) 10Jbond: Revert "C:java: use correct defined call" [puppet] - 10https://gerrit.wikimedia.org/r/770930 [19:28:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34339/console" [puppet] - 10https://gerrit.wikimedia.org/r/770930 (owner: 10Jbond) [19:30:03] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) a:03Ladsgroup I looked at this a b... [19:30:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:30:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298557)', diff saved to https://phabricator.wikimedia.org/P22615 and previous config saved to /var/cache/conftool/dbconfig/20220315-193029-marostegui.json [19:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:33] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [19:32:45] (03CR) 10Krinkle: "General direction LGTM. Can you connect this to a task or larger objective as to why? We talked about it, but I don't recall it exactly ot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (owner: 10Ahmon Dancy) [19:33:50] (03CR) 10Krinkle: mwscript: Support --force-version flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (owner: 10Ahmon Dancy) [19:36:59] RECOVERY - configured eth on ms-be1069 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:39:37] (03PS1) 10Jbond: puppet: add vendored module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 [19:40:56] (03CR) 10jerkins-bot: [V: 04-1] puppet: add vendored module support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771008 (owner: 10Jbond) [19:41:12] (03PS26) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [19:41:14] (03PS5) 10Ryan Kemper: [wdqs] test jvmquake options on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:41:36] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P22616 and previous config saved to /var/cache/conftool/dbconfig/20220315-194152-ladsgroup.json [19:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:43] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [19:47:14] (03PS6) 10Gehel: [wdqs] test jvmquake options on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:48:05] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:51:17] (03PS1) 10Herron: watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) [19:52:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [19:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [19:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:57] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [19:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:10] (03PS27) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [19:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298743)', diff saved to https://phabricator.wikimedia.org/P22617 and previous config saved to /var/cache/conftool/dbconfig/20220315-195657-ladsgroup.json [19:56:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:57:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:02] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [19:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:21] (03PS4) 10Ahmon Dancy: mwscript: Support --force-version flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) [19:57:23] (03PS1) 10Kosta Harlan: GrowthExperiments: Add another entry to GECampaignPatterns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771011 (https://phabricator.wikimedia.org/T302738) [19:58:12] (03CR) 10Ahmon Dancy: mwscript: Support --force-version flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy) [19:59:23] (03PS7) 10Gehel: [wdqs] test jvmquake options on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:59:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:59:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298743)', diff saved to https://phabricator.wikimedia.org/P22618 and previous config saved to /var/cache/conftool/dbconfig/20220315-195934-ladsgroup.json [19:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:56] (03CR) 10MewOphaswongse: [C: 03+1] GrowthExperiments: Add another entry to GECampaignPatterns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771011 (https://phabricator.wikimedia.org/T302738) (owner: 10Kosta Harlan) [20:00:05] RoanKattouw and Urbanecm: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220315T2000). [20:00:05] kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:26] (03CR) 10Gehel: "Still needs the .deb package to be available" [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [20:00:43] hi [20:01:15] I can deploy the change but fine if someone else wants to do it [20:02:14] OK, I'll get started :) [20:02:49] (03CR) 10Kosta Harlan: [C: 03+2] "backport/config window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771011 (https://phabricator.wikimedia.org/T302738) (owner: 10Kosta Harlan) [20:03:46] (03Merged) 10jenkins-bot: GrowthExperiments: Add another entry to GECampaignPatterns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771011 (https://phabricator.wikimedia.org/T302738) (owner: 10Kosta Harlan) [20:03:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298557)', diff saved to https://phabricator.wikimedia.org/P22619 and previous config saved to /var/cache/conftool/dbconfig/20220315-200357-marostegui.json [20:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:04] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [20:04:25] kostajh: I just arrived too if you need me [20:04:37] urbanecm: ty, I'll let you know [20:07:46] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (0324 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [20:09:50] urbanecm: I can't actually verify because the code change this config supports is in the process of merging (flaky CI build), but it didn't break anything and seems fine to sync as is [20:10:17] kostajh: sounds good to me [20:10:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22620 and previous config saved to /var/cache/conftool/dbconfig/20220315-201147-marostegui.json [20:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:52] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [20:12:55] !log kharlan@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771011|GrowthExperiments: Add another entry to GECampaignPatterns (T302738)]] (duration: 02m 22s) [20:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:58] T302738: Account creation: proof-of-concept landing page with video - https://phabricator.wikimedia.org/T302738 [20:14:06] all done [20:17:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22621 and previous config saved to /var/cache/conftool/dbconfig/20220315-201902-marostegui.json [20:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:20] (03PS1) 10Andrew Bogott: backy2: move sqlite db to /srv [puppet] - 10https://gerrit.wikimedia.org/r/771016 [20:21:29] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [20:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:06] (03PS28) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [20:22:48] (03CR) 10Andrew Bogott: [C: 03+2] backy2: move sqlite db to /srv [puppet] - 10https://gerrit.wikimedia.org/r/771016 (owner: 10Andrew Bogott) [20:26:27] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [20:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22622 and previous config saved to /var/cache/conftool/dbconfig/20220315-202652-marostegui.json [20:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6010.drmrs.wmnet with OS buster [20:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:26] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6010.drmrs.wmnet with OS buster [20:27:43] !log Toolhub: running post-deploy database migrations [20:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:51] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) To not get distracted, I would like to... [20:34:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22623 and previous config saved to /var/cache/conftool/dbconfig/20220315-203407-marostegui.json [20:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:30] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) @jcrespo Did you test the POC I ment... [20:38:55] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [20:40:53] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 671 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22624 and previous config saved to /var/cache/conftool/dbconfig/20220315-204157-marostegui.json [20:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:22] (03CR) 10RLazarus: [C: 03+1] varnish_slo: enable multi/all selectors and display all sites in panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 (https://phabricator.wikimedia.org/T302842) (owner: 10Herron) [20:45:44] (03PS6) 10Herron: varnish_slo: enable multi/all selectors and display all sites in panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 (https://phabricator.wikimedia.org/T302842) [20:46:23] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:47:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [20:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:27] (03CR) 10DCausse: [wdqs] test jvmquake options on wdqs1010 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [20:47:45] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 63 probes of 671 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:47:56] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [20:49:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298557)', diff saved to https://phabricator.wikimedia.org/P22625 and previous config saved to /var/cache/conftool/dbconfig/20220315-204912-marostegui.json [20:49:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [20:49:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [20:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:17] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [20:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6010.drmrs.wmnet with reason: host reimage [20:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:44] (03PS1) 10Zabe: rdbms: provide $owner argument in LoadBalancer::flushPrimarySessions() [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770935 (https://phabricator.wikimedia.org/T303885) [20:56:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298743)', diff saved to https://phabricator.wikimedia.org/P22626 and previous config saved to /var/cache/conftool/dbconfig/20220315-205618-ladsgroup.json [20:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:23] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [20:56:46] (03CR) 10Herron: [V: 03+2 C: 03+2] varnish_slo: enable multi/all selectors and display all sites in panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 (https://phabricator.wikimedia.org/T302842) (owner: 10Herron) [20:57:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298563)', diff saved to https://phabricator.wikimedia.org/P22627 and previous config saved to /var/cache/conftool/dbconfig/20220315-205702-marostegui.json [20:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:09] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [20:57:29] (03PS1) 10RLazarus: kubernetes: Upgrade default envoy version to 1.18.3 [puppet] - 10https://gerrit.wikimedia.org/r/771053 (https://phabricator.wikimedia.org/T300324) [21:02:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300775)', diff saved to https://phabricator.wikimedia.org/P22628 and previous config saved to /var/cache/conftool/dbconfig/20220315-210204-marostegui.json [21:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:14] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [21:08:23] (03PS1) 10Herron: slo: update logstash queries for site/cluster selector [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/771064 [21:08:31] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 fails to pxe boot - https://phabricator.wikimedia.org/T303776 (10RobH) Updated bios, raid, and nic firmwares (both 1g and 10g) [21:09:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:55] (03CR) 10Ladsgroup: [C: 03+2] rdbms: provide $owner argument in LoadBalancer::flushPrimarySessions() [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770935 (https://phabricator.wikimedia.org/T303885) (owner: 10Zabe) [21:11:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye [21:11:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22629 and previous config saved to /var/cache/conftool/dbconfig/20220315-211123-ladsgroup.json [21:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:27] Thanks for merging that Amir1 [21:12:47] ^^ [21:17:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P22630 and previous config saved to /var/cache/conftool/dbconfig/20220315-211711-marostegui.json [21:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:43] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:47] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1026.eqiad.wmnet with OS bullseye [21:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:20] (03Merged) 10jenkins-bot: rdbms: provide $owner argument in LoadBalancer::flushPrimarySessions() [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770935 (https://phabricator.wikimedia.org/T303885) (owner: 10Zabe) [21:26:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22631 and previous config saved to /var/cache/conftool/dbconfig/20220315-212628-ladsgroup.json [21:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:29] (03PS2) 10Herron: slo: update logstash queries for site/cluster selector [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/771064 [21:27:48] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.26/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Backport: [[gerrit:770935|rdbms: provide $owner argument in LoadBalancer::flushPrimarySessions() (T303885)]] (duration: 00m 53s) [21:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:52] T303885: Wikimedia\Rdbms\DBUnexpectedError: MWExceptionHandler::rollbackPrimaryChangesAndLog: Database is owned by ID '1923155107' (got '') - https://phabricator.wikimedia.org/T303885 [21:29:09] (03CR) 10Herron: [V: 03+2 C: 03+2] slo: update logstash queries for site/cluster selector [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/771064 (owner: 10Herron) [21:29:42] jeena: This is deployed, see if the error logs got fixed or not. If not, let me know and I will do a batch revert [21:30:06] The errors had died down before sync but I will keep an eye on it [21:30:39] thanks again! [21:32:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P22632 and previous config saved to /var/cache/conftool/dbconfig/20220315-213216-marostegui.json [21:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6010.drmrs.wmnet with OS buster [21:36:00] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [21:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:10] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6010.drmrs.wmnet with OS buster completed: - cp6010 (**WARN**) -... [21:36:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:36:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:13] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) @fgiunchedi These have been installed and I can ssh into each server but they're failing to do some puppet runs and updates. I am not sure if thi... [21:41:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298743)', diff saved to https://phabricator.wikimedia.org/P22633 and previous config saved to /var/cache/conftool/dbconfig/20220315-214133-ladsgroup.json [21:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:37] T298743: Apply alter for transcode_time_* columns on wmf wikis - https://phabricator.wikimedia.org/T298743 [21:45:02] (03CR) 10Ladsgroup: "Running this code to test:" [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup) [21:45:24] (03PS1) 10Herron: slo: update etcd queries to use site/cluster variables [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/771069 [21:47:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [21:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300775)', diff saved to https://phabricator.wikimedia.org/P22634 and previous config saved to /var/cache/conftool/dbconfig/20220315-214721-marostegui.json [21:47:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:47:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:26] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [21:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T300775)', diff saved to https://phabricator.wikimedia.org/P22635 and previous config saved to /var/cache/conftool/dbconfig/20220315-214729-marostegui.json [21:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [21:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:52] (03PS2) 10Herron: slo: update etcd queries to use site/cluster variables [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/771069 [21:49:45] (03CR) 10Herron: [V: 03+2 C: 03+2] slo: update etcd queries to use site/cluster variables [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/771069 (owner: 10Herron) [21:53:28] (03PS1) 10BryanDavis: toolhub: Bump container version to 2022-03-15-214735-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/771070 (https://phabricator.wikimedia.org/T303889) [21:53:58] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-03-15-214735-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/771070 (https://phabricator.wikimedia.org/T303889) (owner: 10BryanDavis) [21:55:37] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [21:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:41] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [21:56:39] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10RobH) updated the firmware for: idrac, bios, network cards [21:56:47] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [21:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:49] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10Andrew) After firmware upgrades, the behavior is somewhat worse; pxe boot fails entirely now (although dhcp seems to still be working!) ` Booting from B... [21:58:17] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-03-15-214735-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/771070 (https://phabricator.wikimedia.org/T303889) (owner: 10BryanDavis) [21:59:55] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-be [21:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:18] (03CR) 10Ladsgroup: "The schema change and replication logic is working just fine (e.g. if there is at least a db to be skipped in codfw, it does run it on all" [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup) [22:00:26] RECOVERY - Check systemd state on cp6010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=varnish-fe [22:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:50] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=cache_text,service=ats-tls [22:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:20] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [22:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:34] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) [22:02:28] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [22:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:04] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 21 threshold =0.15 breach: relocating_shards: 0, initializing_shards: 0, unassigned_shards: 21, number_of_in_flight_fetch: 0, number_of_nodes: 1, task_max_waiting_in_queue_millis: 0, active_shards: 21, cluster_name: relforge-eqiad-small-alpha, active_shards_percent_as_number: 50.0, timed_out: False, number_of_data_ [22:03:04] , active_primary_shards: 21, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, status: red https://wikitech.wikimedia.org/wiki/Search%23Administration [22:03:18] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 145 threshold =0.15 breach: number_of_nodes: 1, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, active_shards: 146, unassigned_shards: 145, active_primary_shards: 146, status: red, number_of_data_nodes: 1, number_of_in_flight_fetch: 0, timed_out: False, delayed_unassigned_shards: 0, initializing_sh [22:03:18] relocating_shards: 0, active_shards_percent_as_number: 50.171821305841924, cluster_name: relforge-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration [22:03:35] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [22:03:36] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) Firmware updates do not seem to have improved anything; same failure to pxe boot as before. [22:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:41] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [22:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [22:05:57] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [22:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:26] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:16] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [22:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:42] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye [22:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:46] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: active_primary_shards: 37, unassigned_shards: 5, active_shards: 37, status: yellow, task_max_waiting_in_queue_millis: 0, timed_out: False, active_shards_percent_as_number: 88.09523809523809, number_of_nodes: 2, number_of_pending_tasks: 0, initializing_shards: 0, number_of_data_nodes: 2, relocating_shards: 0 [22:08:46] _of_in_flight_fetch: 0, cluster_name: relforge-eqiad-small-alpha, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:49] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1026.eqiad.wmnet with OS bullseye [22:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:15] (03PS1) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [22:26:16] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 130 threshold =0.15 breach: timed_out: False, number_of_pending_tasks: 0, relocating_shards: 0, status: yellow, unassigned_shards: 130, initializing_shards: 0, cluster_name: relforge-eqiad, active_primary_shards: 163, number_of_nodes: 2, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, number_of_data_nod [22:26:16] ask_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 55.631399317406135, active_shards: 163 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:26] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:02] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:57] (03PS1) 10Ebernhardson: cirrus: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/771076 [22:44:35] (03CR) 10jerkins-bot: [V: 04-1] cirrus: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/771076 (owner: 10Ebernhardson) [22:44:46] 10SRE, 10Security-Team, 10Stewards-and-global-tools: Investigate the practice of making thousands of global blocks per day on Meta-Wiki - https://phabricator.wikimedia.org/T303774 (10AntiCompositeNumber) Historically, for whatever reason, that hasn't been done, and there are several examples of stewards usin... [22:46:46] (03PS2) 10Ebernhardson: cirrus: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/771076 (https://phabricator.wikimedia.org/T302733) [23:02:15] (03PS1) 10Jdlrobson: Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) [23:08:06] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:08:46] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:09:44] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 56, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:12:21] (03CR) 10Clare Ming: [C: 03+1] Add script to update vector skin preferences [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770937 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [23:36:53] (03PS1) 10Aaron Schulz: rdbms: use the LoadBalancer id in flushPrimarySessions() [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 [23:46:18] (CertAlmostExpired) firing: Certificate for inference:30443 is about to expire - https://alerts.wikimedia.org [23:51:36] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:52:23] (03CR) 10jerkins-bot: [V: 04-1] rdbms: use the LoadBalancer id in flushPrimarySessions() [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (owner: 10Aaron Schulz)