[00:08:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:10:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:14:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:16:21] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:16:25] (03PS1) 10Zabe: logstash: remove absented cron and file [puppet] - 10https://gerrit.wikimedia.org/r/711233 (https://phabricator.wikimedia.org/T273673) [00:17:22] (03PS1) 10Bstorm: maintain-dbusers: delete users that are removed from ldap [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) [00:20:38] (03CR) 10Bstorm: "This is the least-effort version of doing this" [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [00:21:55] (03CR) 10Bstorm: [C: 03+2] toolforge: add shells in /usr/bin to wheel_of_misfortune [puppet] - 10https://gerrit.wikimedia.org/r/710598 (owner: 10Majavah) [00:24:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:26:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:35:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:36:15] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [00:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:38:22] !log bstorm@cumin1001 END (ERROR) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=97) [00:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:51:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:54:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:56:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:56:57] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:59:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:59:30] (03PS1) 10Bstorm: wikireplicas: add labswiki manually to s6 and refactor a bit [puppet] - 10https://gerrit.wikimedia.org/r/711240 (https://phabricator.wikimedia.org/T287442) [01:00:13] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [01:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:21] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup.service,rsync-data-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:06:19] (03PS2) 10Dave Pifke: arclamp: add temporary excimer-k8s pipeline [puppet] - 10https://gerrit.wikimedia.org/r/711166 (https://phabricator.wikimedia.org/T288165) [01:08:25] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:15:35] (03PS1) 10Milimetric: role::common::aqs: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/711241 [01:16:09] (03CR) 10jerkins-bot: [V: 04-1] role::common::aqs: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/711241 (owner: 10Milimetric) [01:17:32] (03CR) 10Milimetric: [C: 03+1] "We forgot to point to the new snapshot so our data's a bit stale. This should be merged and deployed in accordance to the instructions at" [puppet] - 10https://gerrit.wikimedia.org/r/711241 (owner: 10Milimetric) [01:18:45] (03PS2) 10Milimetric: role::common::aqs: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/711241 [01:22:31] !log bstorm@cumin1001 Added views for new wiki: jvwikisource T286245 [01:22:31] !log bstorm@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [01:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:38] T286245: Prepare and check storage layer for jvwikisource - https://phabricator.wikimedia.org/T286245 [01:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:05] (03PS4) 10Legoktm: noc: Expose primary datacenter on conf/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 [01:32:36] (03CR) 10Jforrester: [C: 03+1] "Neat." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [01:33:45] (03PS1) 10Legoktm: configmaster: Add shellbox-constraints to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/711245 (https://phabricator.wikimedia.org/T285104) [01:33:57] (03PS2) 10Legoktm: configmaster: Add shellbox-constraints to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/711245 (https://phabricator.wikimedia.org/T285104) [01:34:02] (03CR) 10jerkins-bot: [V: 04-1] configmaster: Add shellbox-constraints to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/711245 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [01:35:09] (03CR) 10Legoktm: [C: 03+2] noc: Expose primary datacenter on conf/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [01:35:30] (03CR) 10jerkins-bot: [V: 04-1] configmaster: Add shellbox-constraints to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/711245 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [01:35:55] (03Merged) 10jenkins-bot: noc: Expose primary datacenter on conf/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [01:36:18] (03PS3) 10Legoktm: configmaster: Add shellbox-constraints to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/711245 (https://phabricator.wikimedia.org/T285104) [01:38:25] !log legoktm@deploy1002 Synchronized docroot/noc/conf/index.php: noc: Expose primary datacenter on conf/ (duration: 01m 06s) [01:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:38] https://noc.wikimedia.org/conf/ [01:38:42] > Current primary datacenter: codfw [01:38:47] nice [01:40:01] (03CR) 10Legoktm: [C: 03+2] configmaster: Add shellbox-constraints to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/711245 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [01:42:55] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:47:33] !log dpifke@deploy1002 Started deploy [performance/navtiming@12d8381]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/693423 [01:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:40] !log dpifke@deploy1002 Finished deploy [performance/navtiming@12d8381]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/693423 (duration: 00m 06s) [01:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:57] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:49:08] !log dpifke@deploy1002 Started deploy [performance/navtiming@12d8381]: Revert https://gerrit.wikimedia.org/r/c/performance/navtiming/+/693423 [01:49:13] !log dpifke@deploy1002 Finished deploy [performance/navtiming@12d8381]: Revert https://gerrit.wikimedia.org/r/c/performance/navtiming/+/693423 (duration: 00m 05s) [01:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:04:17] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:08:11] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:10:25] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:15:33] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:05] (03PS1) 10Samwilson: Disable Collection sidebar link on English Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711248 (https://phabricator.wikimedia.org/T288021) [02:21:19] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:33:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:46:55] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:49:49] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:50:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:52:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:56:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:05:51] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:11:19] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:39] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:39:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:42:24] (03CR) 10Razzi: [C: 03+2] role::common::aqs: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/711241 (owner: 10Milimetric) [03:43:14] (03CR) 10Razzi: [C: 03+2] "I've done this before, not too complicated, going to go ahead and take care of it now" [puppet] - 10https://gerrit.wikimedia.org/r/711241 (owner: 10Milimetric) [03:45:02] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - razzi@cumin1001 [03:45:07] !log razzi@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - razzi@cumin1001 [03:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:54] (03PS1) 10Razzi: Workaround quote escaping bug [cookbooks] - 10https://gerrit.wikimedia.org/r/711249 [03:51:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:53:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:58:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:06:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:10] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454 [04:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:18] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [04:15:27] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454 [04:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2104 with weight 0 T287454', diff saved to https://phabricator.wikimedia.org/P16996 and previous config saved to /var/cache/conftool/dbconfig/20210811-041625-root.json [04:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:40] (03PS2) 10Marostegui: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/711114 (https://phabricator.wikimedia.org/T287454) [04:29:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:31:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/711114 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [04:31:53] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:35:46] (03PS2) 10Razzi: Workaround quote escaping bug [cookbooks] - 10https://gerrit.wikimedia.org/r/711249 [04:37:16] (03CR) 10Razzi: "We can do the same for the Druid cookbook too if we want" [cookbooks] - 10https://gerrit.wikimedia.org/r/711249 (owner: 10Razzi) [04:39:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:41:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:45:01] PROBLEM - Disk space on wdqs2004 is CRITICAL: DISK CRITICAL - free space: /srv 100959 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2004&var-datasource=codfw+prometheus/ops [04:47:31] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:49:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:50:15] In 10 minutes we're going to failover s2 master [04:53:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:04] marostegui and kormat: How many deployers does it take to do s2 database master failover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T0500). [05:00:06] let's go? [05:00:19] 👍 [05:00:22] jouncebot: deployers don't know...but it takes 2 days! [05:00:26] !log Starting s2 codfw failover from db2107 to db2104 - T287454 [05:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:33] haha [05:00:34] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [05:00:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T287454', diff saved to https://phabricator.wikimedia.org/P16997 and previous config saved to /var/cache/conftool/dbconfig/20210811-050040-marostegui.json [05:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:52] ro confirmed [05:01:03] here we go... [05:01:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:18] stuck on Executing 'SET GLOBAL rpl_semi_sync_master_enabled = 1' on db2104 for now [05:01:29] 🤞 [05:01:36] 30 seconds stuck on that now [05:02:09] still there [05:02:21] you're running from cumin1001, correct? [05:02:29] yep [05:02:44] processlist shows state NULL [05:02:51] whatever that means [05:02:52] | 795218262 | root | 10.64.32.25:39432 | NULL | Query | 104 | NULL | SET GLOBAL rpl_semi_sync_master_enabled = 1 | 0.000 | [05:03:16] hmmm [05:03:21] does that mean the command is still running? [05:03:44] the change has already taken effect btw [05:03:51] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 169 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:14] yeah, I was checking that [05:04:46] I have killed it [05:04:51] and now I do get ask to continue [05:04:53] should I go ahead? [05:05:04] ERROR 2013: Lost connection to MySQL server during query [05:05:04] [WARNING] Semisync could not be enabled on the new master [05:05:08] But it is indeed enabled [05:05:12] i'd say yes, go for it [05:05:19] ok, let's see [05:05:31] it has failed [05:05:40] /o\ [05:05:42] Executing 'SET SESSION max_statement_time = 5.0' [05:05:42] Traceback (most recent call last): [05:05:42] File "/usr/bin/db-switchover", line 11, in [05:05:45] I have all the trace [05:06:00] oh. probably because the connection was killed [05:06:04] and it's not reconnecting :( [05:06:04] yeah [05:06:07] I thought it would reconnect [05:06:12] it's.. not that smart [05:06:19] let me try one more run and if not let's roll back [05:06:26] 👍 [05:06:35] marostegui: can we kill the query next time, while keeping the connection? [05:06:42] or is that not how this works? [05:06:51] it is a bit of a mess now, as the original master is on RO [05:06:55] let me undo that now [05:07:22] ok, now it works [05:07:34] looks like it has worked [05:07:40] 🤯 [05:07:47] let's confimr [05:08:09] replication looks good [05:08:32] RO looks good [05:08:41] and semi sync [05:08:42] orchestrator isn't happy, says replication lag [05:08:49] that's because of the heartbeat [05:08:52] ah, we need to clean up th.. yeah [05:08:54] it is not real [05:08:55] yeah [05:09:22] so I am going to set it to rw [05:10:10] the old master is on RO [05:10:16] and has no slaves [05:10:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2104 to s2 master and set section read-write T287454', diff saved to https://phabricator.wikimedia.org/P16998 and previous config saved to /var/cache/conftool/dbconfig/20210811-051041-root.json [05:10:44] RW enabled [05:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:50] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [05:11:10] let me clear heartbeat so orchestrator looks good [05:11:57] should be good now [05:12:32] yep! [05:13:19] I will paste all the traces on the task, we need to take a look at why the first one didn't work and the second did [05:13:37] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:16:06] 🎉 [05:16:45] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/710517 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [05:18:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2107 T287454', diff saved to https://phabricator.wikimedia.org/P16999 and previous config saved to /var/cache/conftool/dbconfig/20210811-051856-marostegui.json [05:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:03] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [05:22:13] !log Stop replication on db2107 T287454 [05:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:17] (03PS1) 10Marostegui: db2107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/711257 (https://phabricator.wikimedia.org/T287454) [05:23:56] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:23:56] (03CR) 10Marostegui: [C: 03+2] db2107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/711257 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [05:26:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:30:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:27] (03PS1) 10Marostegui: install_server: Reimage db2107 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/711258 (https://phabricator.wikimedia.org/T287230) [05:33:26] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2107 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/711258 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [05:36:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:10] PROBLEM - snapshot of s4 in eqiad on alert1001 is CRITICAL: Last snapshot for s4 at eqiad (db1139.eqiad.wmnet:3314) taken on 2021-08-11 03:38:35 is 1482 GB, but previous one was 1758 GB, a change of 15.7% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:38:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:46:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:53:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2107.codfw.wmnet with reason: REIMAGE [05:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:59] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2107.codfw.wmnet with reason: REIMAGE [05:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:07] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::web::site: add a simple rule to go away conditions [puppet] - 10https://gerrit.wikimedia.org/r/593728 (owner: 10Giuseppe Lavagetto) [06:09:42] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::web::vhost: add the ability to define go away conditions [puppet] - 10https://gerrit.wikimedia.org/r/593727 (owner: 10Giuseppe Lavagetto) [06:15:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:04] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for c-ares [puppet] - 10https://gerrit.wikimedia.org/r/711163 (owner: 10Muehlenhoff) [06:33:06] (03PS3) 10Giuseppe Lavagetto: mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 [06:42:39] (03CR) 10Dzahn: [C: 04-1] "Thanks, looks good so far, but one issue to fix:" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:48:40] (03PS4) 10Giuseppe Lavagetto: mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 [06:50:03] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:51:55] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:52:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 (owner: 10Giuseppe Lavagetto) [06:58:31] (03Merged) 10jenkins-bot: mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 (owner: 10Giuseppe Lavagetto) [06:59:33] <_joe_> !log deleting the staging deployment of mwdebug [06:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:44] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Dzahn) Yes, I agree. Back in the days there was only one type of "shell access" and the term is still used in that sense. Then we split into "deploy... [07:03:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:07:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:08:13] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Qgil) Thank you @Dzahn! One more piece of data for this puzzle. [07:09:11] !log restart etherpad-lite on etherpad1002 to pick up c-ares security updates [07:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:09] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] drafonfly: Clean up and document dragonfly classes [puppet] - 10https://gerrit.wikimedia.org/r/711168 (owner: 10JMeybohm) [07:26:31] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:28:25] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:31:49] (03PS1) 10Jelto: profile::gitlab rsync fix rsync backup command [puppet] - 10https://gerrit.wikimedia.org/r/711348 (https://phabricator.wikimedia.org/T285867) [07:32:32] (03CR) 10jerkins-bot: [V: 04-1] profile::gitlab rsync fix rsync backup command [puppet] - 10https://gerrit.wikimedia.org/r/711348 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:33:11] (03CR) 10Dzahn: "ah, yea, this needs the rsync module name, not a full path. we ran into this a couple times before" [puppet] - 10https://gerrit.wikimedia.org/r/711348 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:33:59] (03PS2) 10Jelto: profile::gitlab rsync fix rsync backup command [puppet] - 10https://gerrit.wikimedia.org/r/711348 (https://phabricator.wikimedia.org/T285867) [07:34:22] (03CR) 10Dzahn: [C: 03+1] profile::gitlab rsync fix rsync backup command [puppet] - 10https://gerrit.wikimedia.org/r/711348 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:35:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:25] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30545/console" [puppet] - 10https://gerrit.wikimedia.org/r/711348 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:37:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:58] (03CR) 10Jelto: [V: 03+1 C: 03+2] profile::gitlab rsync fix rsync backup command [puppet] - 10https://gerrit.wikimedia.org/r/711348 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:41:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:45:11] (03PS8) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [07:47:09] (03PS9) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [07:47:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:48:30] (03PS5) 10Marostegui: mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) [07:51:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) Thanks a lot @Cmjohnson ! I will continue getting them into production now. [07:53:23] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:57:13] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:00:36] (03PS5) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [08:02:12] !log rolling restart of AQS to pick up the c-ares security update [08:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:34] (03CR) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:04:19] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/30547/contint1001.wikimedia.org/change.contint1001.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:05:23] (03CR) 10Dzahn: [V: 04-1] "please use a full path in the command line (then you can skip PATH) and it should fix the issue :parameter 'command' expects a match for S" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:06:08] (03CR) 10Dzahn: [V: 04-1] "[contint1001:~] $ which find" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:07:23] 10SRE, 10Datacenter-Switchover: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Marostegui) [08:11:48] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:14:39] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:16:14] (03PS1) 10Kormat: ProductionServices: Clean up parsercache entries a bit. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711372 [08:16:25] (03CR) 10Marostegui: Move parsercache DB config to *Services.php (1/3) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 (owner: 10Krinkle) [08:17:11] (03CR) 10Marostegui: [C: 03+1] ProductionServices: Clean up parsercache entries a bit. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711372 (owner: 10Kormat) [08:17:15] !log restart Turnilo on an-tool1007 to pick up c-ares security updates [08:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:26] (03CR) 10Kormat: Move parsercache DB config to *Services.php (1/3) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 (owner: 10Krinkle) [08:17:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) Hi @Cmjohnson regarding mw1444, I still could not ssh to the server but I could ssh to mgmt and saw it is currently in an endless loop trying to PXE boot but... [08:19:18] (03CR) 10Kormat: [C: 03+2] ProductionServices: Clean up parsercache entries a bit. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711372 (owner: 10Kormat) [08:19:57] !log restart Aphlict to pick up c-ares security updates [08:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:05] (03Merged) 10jenkins-bot: ProductionServices: Clean up parsercache entries a bit. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711372 (owner: 10Kormat) [08:22:07] jouncebot: now [08:22:07] No deployments scheduled for the next 2 hour(s) and 37 minute(s) [08:23:27] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Minor cleanup of parsercache entries (duration: 01m 17s) [08:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:11] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:26:39] (03PS6) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [08:28:23] (03CR) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:30:31] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] drafonfly: Clean up and document dragonfly classes [puppet] - 10https://gerrit.wikimedia.org/r/711168 (owner: 10JMeybohm) [08:30:46] (03CR) 10JMeybohm: [C: 03+2] Add dragonfly-peer and supernode cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/710528 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:31:48] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:32:13] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [08:35:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) Hi again @Cmjohnson regarding mw1448 through mw1457: I see they are now in DNS but I could not ssh to them and it appears the mgmt password is not set yet t... [08:37:22] (03PS3) 10Labdajiwa: Set the project namespace and sitename for Javanese Wikipedia and Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710565 (https://phabricator.wikimedia.org/T287437) [08:37:51] (03PS1) 10Phuedx: Fix language treatment A/B test bucket counting [skins/Vector] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710717 (https://phabricator.wikimedia.org/T286932) [08:38:53] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "thanks Zabe! looks good now: https://puppet-compiler.wmflabs.org/compiler1002/30548/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:44:16] (03PS1) 10Jcrespo: dbbackups: Fix wrongly configured target host for s7 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/711383 (https://phabricator.wikimedia.org/T288244) [08:44:46] (03PS2) 10Jcrespo: dbbackups: Fix wrongly configured target host for s7 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/711383 (https://phabricator.wikimedia.org/T288244) [08:44:48] Hi, we need to fix a high priority bug on mobileapps (related to vandalism: T288376). We tried yesterday to make a release on the services deployment window but we had to revert because of an issue. Can we deploy in prod outside of the release schedule in the next hour? [08:44:49] T288376: NSFW image incorrectly included in MediaList Response - https://phabricator.wikimedia.org/T288376 [08:46:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:46:56] nemo-yiannis: if the original issue is fixed, I would say so [08:47:25] +1 if it impacts users and it is urgent there shouldn't be any problem [08:47:42] yeah we just merged the patch [08:47:42] (03PS3) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) [08:50:00] (03CR) 10Jcrespo: dbbackups: Switch s7 backups from stretch (db1116) to buster (db1171) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710977 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [08:53:01] (03PS1) 10Jgiannelos: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/711385 [08:53:21] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:55:15] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:34] (03CR) 10Btullis: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/711249 (owner: 10Razzi) [08:57:58] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/711385 (owner: 10Jgiannelos) [08:58:58] (03PS1) 10Vgutierrez: envoyproxy: Add upstream PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/711386 (https://phabricator.wikimedia.org/T271421) [09:00:02] (03CR) 10Elukey: [C: 03+1] "I think it is fine, but is there a more permanent fix from SRE? If there is any plan I'd follow up with them too :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/711249 (owner: 10Razzi) [09:01:07] (03Merged) 10jenkins-bot: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/711385 (owner: 10Jgiannelos) [09:03:27] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:42] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Fix wrongly configured target host for s7 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/711383 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [09:05:05] !log upgrade thanos on thanos-fe* - T288604 [09:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:13] T288604: Upgrade Thanos to 0.21.1 - https://phabricator.wikimedia.org/T288604 [09:08:33] (03PS1) 10Lucas Werkmeister (WMDE): Fix SelectQueryBuilder use in SpecialWhatLinksHere [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710719 (https://phabricator.wikimedia.org/T288565) [09:08:40] jouncebot: now [09:08:40] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [09:08:48] then I’ll deploy that ^ backport right away [09:09:15] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:57] nemo-yiannis: or are you still deploying that mobileapps thing and I should wait? [09:10:34] i am deploying atm [09:10:41] ok, I’ll wait [09:11:04] sounds good, thanks [09:11:42] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [09:11:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:51] (03PS1) 10Zabe: geoip: migrate cron of geoipupdate to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711392 (https://phabricator.wikimedia.org/T273673) [09:12:36] (03CR) 10jerkins-bot: [V: 04-1] geoip: migrate cron of geoipupdate to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711392 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:12:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:13:49] (03PS2) 10Zabe: geoip: migrate cron of geoipupdate to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711392 (https://phabricator.wikimedia.org/T273673) [09:14:26] (03CR) 10jerkins-bot: [V: 04-1] geoip: migrate cron of geoipupdate to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711392 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:14:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:15:55] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:25] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:16:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:18:02] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10fgiunchedi) Hosts are ready to go now, though with swift traffic fully on codfw I don't think we should rebalance now. I see at least two options: 1. Move Swift traffic to eqiad and start rebal... [09:19:34] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:27] !log run "sudo find /var/log/airflow -type f -mtime +15 -delete" on an-airflow1001 to free space (root partition almost full) [09:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:19] (03PS1) 10Vgutierrez: envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) [09:23:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:26:18] !log upgrade thanos on prometheus* - T288604 [09:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:25] T288604: Upgrade Thanos to 0.21.1 - https://phabricator.wikimedia.org/T288604 [09:27:19] (03PS1) 10Muehlenhoff: Propose a format for profile contact data [puppet] - 10https://gerrit.wikimedia.org/r/711400 (https://phabricator.wikimedia.org/T216088) [09:28:00] Lucas_WMDE: we are done with the release [09:28:07] cool, thanks [09:28:33] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix SelectQueryBuilder use in SpecialWhatLinksHere [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710719 (https://phabricator.wikimedia.org/T288565) (owner: 10Lucas Werkmeister (WMDE)) [09:29:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:33:29] (03PS1) 10Vgutierrez: cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) [09:34:21] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:31] (03Merged) 10jenkins-bot: Fix SelectQueryBuilder use in SpecialWhatLinksHere [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710719 (https://phabricator.wikimedia.org/T288565) (owner: 10Lucas Werkmeister (WMDE)) [09:47:31] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:54] ^ deploying the backport now [09:49:22] (03PS1) 10Muehlenhoff: Don't limit the JDK8 hook to Buster [puppet] - 10https://gerrit.wikimedia.org/r/711413 (https://phabricator.wikimedia.org/T287960) [09:49:59] !log upgrade thanos on cloudmetrics* - T288604 [09:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:06] T288604: Upgrade Thanos to 0.21.1 - https://phabricator.wikimedia.org/T288604 [09:50:29] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.18/includes/specials/SpecialWhatLinksHere.php: Backport: [[gerrit:710719|Fix SelectQueryBuilder use in SpecialWhatLinksHere (T288565)]] (duration: 01m 08s) [09:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:36] T288565: Wikimedia\Rdbms\DBQueryError: Error 1066: Not unique table/alias: 'page' - https://phabricator.wikimedia.org/T288565 [09:51:09] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:52:29] (03CR) 10Muehlenhoff: [C: 03+2] Don't limit the JDK8 hook to Buster [puppet] - 10https://gerrit.wikimedia.org/r/711413 (https://phabricator.wikimedia.org/T287960) (owner: 10Muehlenhoff) [09:58:35] (03Abandoned) 10Phuedx: Fix language treatment A/B test bucket counting [skins/Vector] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710717 (https://phabricator.wikimedia.org/T286932) (owner: 10Phuedx) [09:59:58] (03PS1) 10Btullis: Workaround quote escaping bug [cookbooks] - 10https://gerrit.wikimedia.org/r/711420 (https://phabricator.wikimedia.org/T288558) [10:00:30] (03CR) 10Elukey: [C: 03+1] Workaround quote escaping bug [cookbooks] - 10https://gerrit.wikimedia.org/r/711420 (https://phabricator.wikimedia.org/T288558) (owner: 10Btullis) [10:01:19] (03CR) 10Btullis: [V: 03+2 C: 03+2] Workaround quote escaping bug [cookbooks] - 10https://gerrit.wikimedia.org/r/711420 (https://phabricator.wikimedia.org/T288558) (owner: 10Btullis) [10:02:43] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. - btullis@cumin1001 [10:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:28] 10SRE, 10Analytics, 10Patch-For-Review: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10MoritzMuehlenhoff) OpenJDK 8 needs OpenJDK 8 to build itself, I'm currently building an initial package on my laptop to bootstrap this (and import it to component/jdk8), which will... [10:03:56] 10SRE, 10Analytics, 10Infrastructure-Foundations, 10Patch-For-Review: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10MoritzMuehlenhoff) [10:05:08] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add observability role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/710617 (owner: 10Cwhite) [10:05:17] (03PS1) 10Phuedx: Add ad-hoc logging to tally process [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710720 (https://phabricator.wikimedia.org/T288366) [10:05:31] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) a:03jijiki [10:05:47] (03CR) 10Filippo Giunchedi: [C: 03+1] Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [10:07:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_netflow_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack downtime methods fail when the admin reason includes an apostrophe - https://phabricator.wikimedia.org/T288558 (10BTullis) The patch at https://gerrit.wikimedia.org/r/711420 is only a workaround for a single cook... [10:09:01] (03CR) 10Dzahn: [C: 03+1] logstash: remove absented cron and file [puppet] - 10https://gerrit.wikimedia.org/r/711233 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:09:59] (03PS1) 10Muehlenhoff: Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/711421 [10:10:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:11] ACKNOWLEDGEMENT - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_netflow_hourly.service Btullis This job will need to be re-run. It was caused by work undertaken during: T255148 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:00] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/711421 (owner: 10Muehlenhoff) [10:12:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Disable Collection sidebar link on English Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711248 (https://phabricator.wikimedia.org/T288021) (owner: 10Samwilson) [10:16:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:05] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:18:39] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:19:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:20:27] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:37] (03CR) 10Hnowlan: restbase: set lower check_disk thresholds for instance-data volume (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) (owner: 10Hnowlan) [10:28:52] (03CR) 10Hnowlan: [C: 03+2] maps: disable cassandra metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/710984 (https://phabricator.wikimedia.org/T186567) (owner: 10Hnowlan) [10:35:15] (03PS1) 10Giuseppe Lavagetto: mwdebug: re-add memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/711429 [10:35:22] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: re-add memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/711429 (owner: 10Giuseppe Lavagetto) [10:35:38] (03PS2) 10Giuseppe Lavagetto: mwdebug: re-add memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/711429 [10:35:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mwdebug: re-add memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/711429 (owner: 10Giuseppe Lavagetto) [10:37:37] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:42:25] !log rolling restart of Buster-based maps services to pick up c-ares security updates [10:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:33] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:51:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:52:21] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:52:51] PROBLEM - Check systemd state on maps1004 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-metrics-collector.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:57:04] I see there's already 6 patches to be deployed during the next window. If there's any space, then I'd like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/710720 [10:57:59] phuedx: I already deployed my wmf.18 backport, so that one doesn’t need to count, I just didn’t remove it yet [10:58:04] feel free to add your change and we’ll see if it fits [10:58:11] Thanks, Lucas_WMDE [10:58:12] (my four config changes should also be pretty quick) [10:58:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:58:49] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add ad-hoc logging to tally process [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710720 (https://phabricator.wikimedia.org/T288366) (owner: 10Phuedx) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1100). [11:00:05] samwilson, Lucas_WMDE, and phuedx: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:00:16] I can deploy [11:00:32] o/ [11:00:52] phuedx: do you know how long gate-and-submit usually takes for SecurePoll? [11:01:09] hm, looks like main test build succeeded in 4 minutes, so I guess it’s not necessary to +2 long in advance [11:01:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:57] Lucas_WMDE: I'm here [11:02:01] RECOVERY - Check systemd state on maps1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:04] ok, then let’s start with your change [11:02:11] Lucas_WMDE: That sounds correct [11:02:12] (03PS2) 10Lucas Werkmeister (WMDE): Disable Collection sidebar link on English Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711248 (https://phabricator.wikimedia.org/T288021) (owner: 10Samwilson) [11:02:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Disable Collection sidebar link on English Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711248 (https://phabricator.wikimedia.org/T288021) (owner: 10Samwilson) [11:03:12] (03Merged) 10jenkins-bot: Disable Collection sidebar link on English Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711248 (https://phabricator.wikimedia.org/T288021) (owner: 10Samwilson) [11:03:45] SamWilson[m]: your change is on mwdebug2001, can you test it? [11:03:59] great, testing now [11:04:23] I don’t see a difference in the sidebar but maybe it’s already hidden via site CSS [11:04:30] yep looks good [11:04:41] alright [11:04:56] it's not on all namespaces [11:05:06] but on mainspace it's now gone [11:05:11] so, good to go [11:05:46] I was looking at https://en.wikisource.org/wiki/Dangerous_Goods_(Shipping)_Regulation_2012 [11:05:56] anyways, syncing [11:06:07] (03PS1) 10Marostegui: wmnet: Add dbproxy2004 as m5-master in codfw [dns] - 10https://gerrit.wikimedia.org/r/711437 (https://phabricator.wikimedia.org/T288093) [11:06:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:711248|Disable Collection sidebar link on English Wikisource (T288021)]] (duration: 01m 14s) [11:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:47] T288021: Disable/remove BookMaker from enwikisource - https://phabricator.wikimedia.org/T288021 [11:07:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add ad-hoc logging to tally process [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710720 (https://phabricator.wikimedia.org/T288366) (owner: 10Phuedx) [11:08:08] thanks Lucas_WMDE looks good [11:08:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:10:03] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:10:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_netflow_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:11:33] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:41] (03Merged) 10jenkins-bot: Add ad-hoc logging to tally process [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710720 (https://phabricator.wikimedia.org/T288366) (owner: 10Phuedx) [11:13:05] phuedx: would you like to test the SecurePoll backport on mwdebug or should I just sync it directly? [11:13:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:13:19] seems to me like something that could be synced directly [11:14:41] Lucas_WMDE: Synced directly please. Some of those log lines will be produced by a job. Can that scenario be tested via x-wikimedia-debug? [11:14:53] I don’t know [11:15:04] I want to say “probably not” but I’ve been wrong before on what x-wikimedia-debug can test [11:15:16] I think it was urbanec.m who corrected me [11:15:20] but I’ll just sync this [11:16:11] OK. I'll keep an eye on Logstash generally and will be checking for those loglines in various tests most of today [11:16:16] (single sync-file for the whole directory since the files can be synced in any order afaict) [11:16:18] ok [11:17:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SecurePoll/: Backport: [[gerrit:710720|Add ad-hoc logging to tally process (T288366)]] (duration: 01m 09s) [11:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:11] T288366: Add more logging to determine what happens to jobs in the wild - https://phabricator.wikimedia.org/T288366 [11:17:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. - btullis@cumin1001 [11:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:40] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:47] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting $wgWBRepoSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711138 (https://phabricator.wikimedia.org/T257260) [11:17:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting $wgWBRepoSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711138 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:18:59] (03Merged) 10jenkins-bot: Stop setting $wgWBRepoSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711138 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:19:15] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:19] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:21:11] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:21:32] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseRepoEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711139 (https://phabricator.wikimedia.org/T257260) [11:22:29] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:711138|Stop setting $wgWBRepoSettings['entityNamespaces'] (T257260)]] (duration: 01m 08s) [11:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:37] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [11:22:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseRepoEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711139 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:23:39] (03Merged) 10jenkins-bot: Remove $wmgWikibaseRepoEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711139 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:24:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:24:47] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:39] (03PS1) 10Zabe: labstore: migrate cron of archive_export_d to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) [11:25:44] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:711139|Remove $wmgWikibaseRepoEntityNamespaces (T257260)]] (duration: 01m 08s) [11:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:08] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711140 (https://phabricator.wikimedia.org/T257260) [11:26:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting $wgWBClientSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711140 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:27:00] (03Merged) 10jenkins-bot: Stop setting $wgWBClientSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711140 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:28:53] (03CR) 10Ladsgroup: [C: 04-1] dynamicproxy: migrate cron of proxydb-bak to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:29:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:711140|Stop setting $wgWBClientSettings['entityNamespaces'] (T257260)]] (duration: 01m 07s) [11:29:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:25] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [11:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:42] (03CR) 10Ladsgroup: zuul: migrate cron of zuul_repack to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:29:58] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711141 (https://phabricator.wikimedia.org/T257260) [11:30:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseClientEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711141 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:30:15] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:00] (03Merged) 10jenkins-bot: Remove $wmgWikibaseClientEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711141 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:31:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:41] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:711141|Remove $wmgWikibaseClientEntityNamespaces (T257260)]] (duration: 01m 08s) [11:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:10] !log EU backport+config window done [11:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:36:10] (03PS7) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [11:37:02] (03CR) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:37:03] (03PS1) 10JMeybohm: kubernetes:node: Fix disk-type annotation [puppet] - 10https://gerrit.wikimedia.org/r/711455 (https://phabricator.wikimedia.org/T288345) [11:37:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:54] Thanks, Lucas_WMDE! [11:41:02] np [11:41:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:50] (03PS4) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) [11:43:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:43:17] (03CR) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:48:44] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 12 NOOP 38): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30549/console" [puppet] - 10https://gerrit.wikimedia.org/r/711455 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [11:51:26] (03PS2) 10Urbanecm: Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) [11:53:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:53:57] (03PS1) 10Btullis: Switch the second of the zookeeper nodes [puppet] - 10https://gerrit.wikimedia.org/r/711458 (https://phabricator.wikimedia.org/T255148) [11:54:51] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] "This still produced wrong results for kubernetes[12]017 because their SSD model names no longer include the string "ssd". As kubelet won't" [puppet] - 10https://gerrit.wikimedia.org/r/711455 (https://phabricator.wikimedia.org/T288345) (owner: 10JMeybohm) [11:55:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:57:19] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/711459 [11:57:49] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:03] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:58:59] (03CR) 10Btullis: [C: 03+2] Switch the second of the zookeeper nodes [puppet] - 10https://gerrit.wikimedia.org/r/711458 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:01:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:47] PROBLEM - Zookeeper Server on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [12:08:24] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:14:09] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:10] (03PS3) 10Filippo Giunchedi: prometheus: tweak external url to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) [12:15:12] (03PS1) 10Filippo Giunchedi: thanos: add label-drop to rule [puppet] - 10https://gerrit.wikimedia.org/r/711467 (https://phabricator.wikimedia.org/T287142) [12:15:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:16:16] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. - btullis@cumin1001 [12:16:20] !log installing c-ares security updates on stretch [12:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:46] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add label-drop to rule [puppet] - 10https://gerrit.wikimedia.org/r/711467 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [12:18:21] (03PS4) 10Filippo Giunchedi: prometheus: tweak external url to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) [12:19:49] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:08] (03CR) 10Kormat: [C: 03+1] wmnet: Add dbproxy2004 as m5-master in codfw [dns] - 10https://gerrit.wikimedia.org/r/711437 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [12:24:33] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: tweak external url to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [12:30:14] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale-full only: 1 (doc1001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:32:41] !log roll-restart prometheus T284213 [12:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:48] uughh that roll restart went faster than I thought [12:32:48] T284213: Improve AlertManager dashboard - https://phabricator.wikimedia.org/T284213 [12:34:08] I apologise, prometheus is recovering soon [12:35:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_wikifeeds_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:36:40] turns out, swapping cumin -s / -b arguments does not yield the expected result [12:36:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:37:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:40:14] (03CR) 10Ladsgroup: [C: 03+1] labstore: migrate cron of archive_export_d to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:41:40] (03CR) 10Ladsgroup: [C: 04-1] dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:41:52] (03CR) 10Ladsgroup: [C: 03+1] zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:42:10] (03CR) 10Ladsgroup: [C: 03+1] dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:45:18] !log imported openjdk-8 8u302-b08-1~deb10u1 to component/jdk8 for buster-wikimedia (forward port of the latest Java 8 security release) [12:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:49:54] zabe: hi, thanks for the cron patches, do you think you can test them with puppet compiler? You need to figure out what hosts the patch is going to affect and the add it as "Hosts: foo1001.eqiad.wmnet" at the bottom (like "Bug:") and then run "check experimental" [12:50:38] for multiple hosts you can do "Hosts: foo1001.eqiad.wmnet, foo2001.codfw,wmnet [12:54:31] yes, I can try doing that [12:55:10] Thanks [12:56:41] (03PS8) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [12:57:17] (03CR) 10jerkins-bot: [V: 04-1] zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:58:58] (03PS9) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [13:00:29] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:01:44] (03CR) 10Ladsgroup: [C: 03+1] "PCC looks okay: https://puppet-compiler.wmflabs.org/compiler1001/874/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:02:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:09:00] (03PS2) 10Zabe: labstore: migrate cron of archive_export_d to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) [13:10:01] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:12:57] (03PS3) 10Zabe: labstore: migrate cron of archive_export_d to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) [13:16:53] (03CR) 10Marostegui: [C: 03+2] wmnet: Add dbproxy2004 as m5-master in codfw [dns] - 10https://gerrit.wikimedia.org/r/711437 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [13:17:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:56] (03PS3) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) [13:18:50] (03CR) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:21:06] (03PS4) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) [13:21:49] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:24:39] (03PS1) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [13:25:44] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [13:27:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:27:33] (03CR) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:29:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. - btullis@cumin1001 [13:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:42] (03PS1) 10Kormat: mariadb: If semi-sync is enabled, always config master settings. [puppet] - 10https://gerrit.wikimedia.org/r/711489 (https://phabricator.wikimedia.org/T288500) [13:30:15] (03PS4) 10Zabe: labstore: migrate cron of archive_export_d to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) [13:31:17] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:32:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:08] (03PS1) 10Muehlenhoff: Extend hadoop-test Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/711490 [13:32:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:33:12] !log installing Java 8/Java 11 security updates on various analytics hosts [13:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:38] (03PS1) 10Ema: cloud: remove legacy hiera traffic-cache attribute [puppet] - 10https://gerrit.wikimedia.org/r/711491 [13:37:26] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [13:37:44] (03PS2) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [13:39:00] (03PS5) 10Zabe: labstore: migrate cron of archive_export_d to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) [13:39:06] (03CR) 10Ema: [C: 03+2] cloud: remove legacy hiera traffic-cache attribute [puppet] - 10https://gerrit.wikimedia.org/r/711491 (owner: 10Ema) [13:39:10] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:41:23] (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/878/cloudstore1008.wikimedia.org/index.html and https://puppet-compiler.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:43:06] RECOVERY - ensure kvm processes are running on cloudvirt1038 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:46:24] (03CR) 10ArielGlenn: dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:48:51] (03PS3) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [13:52:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:52:33] (03PS1) 10Btullis: Migrate the third zookeeper server in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711497 (https://phabricator.wikimedia.org/T255148) [13:53:56] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:54:08] (03PS1) 10Ema: pontoon: kafka configuration for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/711498 [13:54:46] (03PS1) 10David Caro: wmcs.ceph: add cloudcephosd1018 as osd [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) [13:54:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:56:22] (03PS2) 10David Caro: wmcs.ceph: add cloudcephosd1018 as osd [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) [13:56:37] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [13:56:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:58:44] (03PS3) 10David Caro: wmcs.ceph: add cloudcephosd1018 as osd [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) [13:59:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Andrew) I now have a canary VM running on this host but it is not actually in the scheduling pool yet. We'll see how it does! [14:05:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:07:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:09:16] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:09:44] (03CR) 10Ema: [C: 03+2] pontoon: kafka configuration for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/711498 (owner: 10Ema) [14:19:26] (03PS4) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [14:20:24] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30552/console" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:20:38] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:21:37] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: introduce interactive mode [puppet] - 10https://gerrit.wikimedia.org/r/711506 [14:21:46] !log disabled cassandra-metrics-collector on maps* [14:21:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:50] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30553/console" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:23:04] (03CR) 10jerkins-bot: [V: 04-1] deploy-mwdebug: introduce interactive mode [puppet] - 10https://gerrit.wikimedia.org/r/711506 (owner: 10Giuseppe Lavagetto) [14:23:18] !log installing mx2002 T286911 [14:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:25] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [14:23:42] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:26:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:27:12] (03PS1) 10Filippo Giunchedi: pontoon: move hiera files to 'settings' [puppet] - 10https://gerrit.wikimedia.org/r/711507 [14:28:20] (03CR) 10Effie Mouzeli: [C: 03+1] "ignoring jenkin's failure" [puppet] - 10https://gerrit.wikimedia.org/r/711506 (owner: 10Giuseppe Lavagetto) [14:28:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:28:45] (03CR) 10Filippo Giunchedi: "I _think_ we'd need to rebase the pontoon branches ~soonish and push after this change is merged" [puppet] - 10https://gerrit.wikimedia.org/r/711507 (owner: 10Filippo Giunchedi) [14:31:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:33:12] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:36:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:41:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:42:20] (03CR) 10Btullis: Extend hadoop-test Cumin alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711490 (owner: 10Muehlenhoff) [14:42:49] (03CR) 10Btullis: [C: 03+2] Migrate the third zookeeper server in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711497 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [14:43:18] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting 'useTermsTableSearchFields' Wikibase option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711512 (https://phabricator.wikimedia.org/T288612) [14:43:20] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientUseTermsTableSearchFields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711513 (https://phabricator.wikimedia.org/T288612) [14:43:23] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['fineGrainedLuaTracking'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711514 (https://phabricator.wikimedia.org/T288612) [14:43:26] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseFineGrainedLuaTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711515 (https://phabricator.wikimedia.org/T288612) [14:43:54] (03PS3) 10Razzi: Workaround quote escaping bug [cookbooks] - 10https://gerrit.wikimedia.org/r/711249 [14:43:57] !log depool bast4002.wikimedia.org - T288579 [14:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:06] T288579: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 [14:44:24] !log s/depool/decommission bast4002.wikimedia.org - T288579 [14:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:31] better [14:44:51] !log sukhe@cumin1001 START - Cookbook sre.hosts.decommission for hosts bast4002.wikimedia.org [14:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:57] (03CR) 10Razzi: Workaround quote escaping bug (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/711249 (owner: 10Razzi) [14:46:11] (03PS2) 10Muehlenhoff: Extend hadoop-test Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/711490 [14:46:49] (03CR) 10Muehlenhoff: Extend hadoop-test Cumin alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711490 (owner: 10Muehlenhoff) [14:50:08] PROBLEM - Zookeeper Server on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [14:53:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:55:24] !log sukhe@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts bast4002.wikimedia.org [14:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:39] sigh! [14:55:48] !log sukhe@cumin1001 START - Cookbook sre.hosts.decommission for hosts bast4002.wikimedia.org [14:55:50] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:00] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. - btullis@cumin1001 [14:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:02] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:02:03] !log sukhe@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts bast4002.wikimedia.org [15:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:09] 10SRE, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1001 for hosts: `bast4002.wikimedia.org` - bast4002.wikimedia.org (**FAIL**) - **Host steps raised except... [15:04:20] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [15:05:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:09:00] (03CR) 10Cwhite: [C: 03+2] logstash: remove absented cron and file [puppet] - 10https://gerrit.wikimedia.org/r/711233 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:09:12] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30554/console" [puppet] - 10https://gerrit.wikimedia.org/r/710985 (https://phabricator.wikimedia.org/T186567) (owner: 10Hnowlan) [15:12:19] !log import openjdk-8 8u302-b08-1+wmf1 to bullseye-wikimedia (bootstrap build, not to be used yet) T287960 [15:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:27] T287960: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 [15:14:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:19:00] (03CR) 10Herron: [C: 03+1] Apply MX role to mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [15:23:42] (03CR) 10Eevans: [C: 03+1] restbase: set lower check_disk thresholds for instance-data volume (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) (owner: 10Hnowlan) [15:25:34] (03PS1) 10Filippo Giunchedi: pontoon: add config command [puppet] - 10https://gerrit.wikimedia.org/r/711543 [15:26:03] (03CR) 10jerkins-bot: [V: 04-1] pontoon: add config command [puppet] - 10https://gerrit.wikimedia.org/r/711543 (owner: 10Filippo Giunchedi) [15:26:17] (03CR) 10Filippo Giunchedi: "Any/all feedback is welcome, especially around naming and command description, which I'm sure can be improved." [puppet] - 10https://gerrit.wikimedia.org/r/711543 (owner: 10Filippo Giunchedi) [15:27:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:27:37] 10SRE, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10ssingh) a:03Jclark-ctr [15:28:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:29:34] (03PS10) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [15:29:51] (03PS2) 10Filippo Giunchedi: pontoon: add config command [puppet] - 10https://gerrit.wikimedia.org/r/711543 [15:29:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:31:08] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:31:31] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:33:06] (03CR) 10Michael DiPietro: maintain-dbusers: delete users that are removed from ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [15:37:44] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Restarting to pick up Java security updates - hnowlan@cumin1001 [15:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:40:19] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [15:40:41] 10SRE, 10MW-on-K8s, 10serviceops: Add conditional to mediawiki-config for stuff running on kubernetes - https://phabricator.wikimedia.org/T284418 (10dancy) 05Open→03Resolved [15:43:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:45:46] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team: Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) [15:45:59] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team: Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) p:05Triage→03Medium [15:46:11] (03PS1) 10MSantos: maps: restore tilerator cpu ratio to 0.3 [puppet] - 10https://gerrit.wikimedia.org/r/711554 [15:46:13] (03PS1) 10MSantos: maps: bump kartotherian PG query timeout [puppet] - 10https://gerrit.wikimedia.org/r/711555 [15:54:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) Thank you all for this work! [16:00:11] dancy: are your ready to scap some things? [16:00:24] jouncebot: next [16:00:24] In 1 hour(s) and 59 minute(s): Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1800) [16:00:24] In 1 hour(s) and 59 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1800) [16:02:07] (03CR) 10Jgiannelos: "Are you skipping maps1008 on purpose?" [puppet] - 10https://gerrit.wikimedia.org/r/711554 (owner: 10MSantos) [16:02:48] (03CR) 10Jgiannelos: maps: restore tilerator cpu ratio to 0.3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711554 (owner: 10MSantos) [16:04:11] (03CR) 10Hnowlan: [C: 03+1] "Will this need to be manually raised outside of this file? I can merge this change." [puppet] - 10https://gerrit.wikimedia.org/r/711555 (owner: 10MSantos) [16:04:13] * legoktm is around too [16:04:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:04:58] (03CR) 10Hnowlan: "Should we change the default globally rather than do this? I have a CR to harmonise the configurations that would assist with this also" [puppet] - 10https://gerrit.wikimedia.org/r/711554 (owner: 10MSantos) [16:05:03] !log dancy@deploy1002 Synchronized README: Testing scap php-rpm rolling restart (before) (duration: 01m 19s) [16:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:56] (03PS1) 10Ema: pontoon: move traffic stack kafka settings to separate file [puppet] - 10https://gerrit.wikimedia.org/r/711558 [16:08:05] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:08:38] (03PS2) 10Ema: pontoon: move traffic stack kafka settings to separate file [puppet] - 10https://gerrit.wikimedia.org/r/711558 [16:09:11] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [16:10:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. - btullis@cumin1001 [16:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:15] (03PS1) 10Jcrespo: mediabackups: Add python3 dependencies and misc changes for workers [puppet] - 10https://gerrit.wikimedia.org/r/711564 (https://phabricator.wikimedia.org/T276442) [16:13:47] (03CR) 10jerkins-bot: [V: 04-1] mediabackups: Add python3 dependencies and misc changes for workers [puppet] - 10https://gerrit.wikimedia.org/r/711564 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [16:15:16] (03PS2) 10Jcrespo: mediabackups: Add python3 dependencies and misc changes for workers [puppet] - 10https://gerrit.wikimedia.org/r/711564 (https://phabricator.wikimedia.org/T276442) [16:16:28] !log moment of truth for php-fpm-always-restart in scap [16:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:07] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:17:15] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add python3 dependencies and misc changes for workers [puppet] - 10https://gerrit.wikimedia.org/r/711564 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [16:18:07] (03PS2) 10MSantos: maps: restore tilerator cpu ratio to 0.3 [puppet] - 10https://gerrit.wikimedia.org/r/711554 [16:18:17] btullis: glad that workaround worked :) sorry you had to do it [16:19:20] !log dancy@deploy1002 Synchronized README: Testing scap php-rpm rolling restart (after) (duration: 03m 12s) [16:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:20:51] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:22:35] !log Results of testing php_fpm_always_restart: php_fpm_always_restart=false: 1m19.942s php_fpm_always_restart=true: 3m12.836s [16:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:49] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [16:24:13] no apparent increase in 5xx errors [16:24:51] strange, it fits a latency issues I am experiencing with gerrit [16:25:06] maybe I am part of a ISP connectivity issue? [16:25:12] (03PS1) 10BBlack: wikidough check: example authdns part [dns] - 10https://gerrit.wikimedia.org/r/711577 [16:25:14] (03PS1) 10Jcrespo: mediabackups: Hide diffs from logs on sensitive files and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/711576 (https://phabricator.wikimedia.org/T276442) [16:25:17] (03PS1) 10BBlack: checkdoh: disable ECS for check subdomain [puppet] - 10https://gerrit.wikimedia.org/r/711578 [16:25:18] this is interesting https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=20&orgId=1&from=1628697268520&to=1628699068520&forceLogin=true&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [16:25:58] dancy: those two spikes in MW 5xx rate correspond with the two tests you ran [16:25:59] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:06] :-( [16:26:11] (03PS1) 10Elukey: kubeflow: add workaround for TLS validation in storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711579 (https://phabricator.wikimedia.org/T272919) [16:26:18] (03CR) 10jerkins-bot: [V: 04-1] mediabackups: Hide diffs from logs on sensitive files and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/711576 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [16:26:32] not sure what to make of it exactly, especially since there was a spike on the "before" [16:26:41] oh, I spoke too soon :/ [16:26:44] checking older deployments now, maybe it's one of those "it's always been like this" situations [16:26:58] (03CR) 10jerkins-bot: [V: 04-1] wikidough check: example authdns part [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [16:27:43] (03PS2) 10Elukey: kubeflow: add workaround for TLS validation in storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711579 (https://phabricator.wikimedia.org/T272919) [16:27:47] I'm also curious if that difference in peak size between "before" and "after" is a real effect or just a sampling issue, since we only have data points every 30 seconds (and "after" took longer, as intended) [16:27:58] (03PS1) 10Dave Pifke: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) [16:27:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:28:13] the before should represent the status quo, right? [16:28:18] yes [16:28:34] yeah [16:28:43] (03CR) 10Elukey: "A little ashamed by this change but I can't find another solution :(" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711579 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:29:00] (03PS3) 10Dave Pifke: arclamp: add temporary excimer-k8s pipeline [puppet] - 10https://gerrit.wikimedia.org/r/711166 (https://phabricator.wikimedia.org/T288165) [16:29:07] (03CR) 10jerkins-bot: [V: 04-1] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711580 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [16:29:42] (03PS3) 10Elukey: kubeflow: add workaround for TLS validation in storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711579 (https://phabricator.wikimedia.org/T272919) [16:29:53] (so, to be clear, definitely not pinning the blame on php-fpm-always-restart yet) [16:30:06] (03PS2) 10Jcrespo: mediabackups: Hide diffs from logs on sensitive files and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/711576 (https://phabricator.wikimedia.org/T276442) [16:30:06] rzl: that is expected [16:30:25] oh, really? [16:30:40] (03PS2) 10BBlack: wikidough check: example authdns part [dns] - 10https://gerrit.wikimedia.org/r/711577 [16:30:40] rzl: there were discussions before, where, if I remember correctly, we knew that this is the hit we are taking [16:30:50] when bluntly restarting the whole cluster [16:31:36] got it [16:31:37] this is only mw app servers, and wouldn't affect other wm apache servers, right? [16:31:47] (I am trying to see if other issue is related) [16:31:52] it affects all mediawiki clusters [16:31:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:31:56] so that includes parsoid [16:32:04] jynus: this is all mediawiki clusters but not gerrit, if that's what you're looking at [16:32:19] rzl: I think it is in a task somewhere, I will look for it at some point [16:32:21] yes, rzl thank you, that what I wanted to know [16:32:38] so unrelated [16:33:29] effie: cool, no worries - I wasn't in that conversation but I trust the folks who were, I just didn't realize we'd already expected that and made a decision about it [16:34:12] it is not much about decision rather than, there are no easier ways around it [16:34:53] if we wanted to restart php-fpm every time that is [16:35:19] I cant say for sure it has been decided, but it has surely been discussed, and it is expected [16:36:42] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Hide diffs from logs on sensitive files and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/711576 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [16:36:56] I'm curious about the kinds of requests that are 500ing. Appservers vs jobrunners — jobrunners will retry these long-running requests, whereas folks will notice the appservers. [16:37:19] (is my possibly erroneous understanding) [16:38:08] thcipriani: given how much we are using our cache [16:38:53] and what percentage of our total traffic hits a mediawiki server [16:39:02] it is not as bad as it looks I would say [16:39:07] :) [16:39:31] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Cmjohnson) just tried these steps and the disk is not being seen. I need to reseat the disk and try again [16:40:02] this fits with a point Kri.nkle once made to me that if we lost editing in Belgium we may not even notice :P [16:40:35] tl;dr: our traffic is hard [16:41:08] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) 05Open→03Resolved I am going to resolve this task because the relocation is complete. [16:41:48] yeah, and it would be extremely cool if we had some kind of Objectives for our Service Levels :P [16:42:01] crazy talk [16:42:02] I've heard that some cool folks are thinking about that [16:42:47] dancy: sorry if I panicked you, turns out I was just uninformed [16:43:42] I'm glad it's all settled out. So are we all cool on pulling the trigger and setting php_fpm_always_restart to true? [16:44:03] 10SRE, 10ops-eqiad, 10DC-Ops: Update Documentation for dl360 Motherboard Swap - https://phabricator.wikimedia.org/T254272 (10Cmjohnson) 05Open→03Declined I am declining this task, we do not have many of these servers left in production. [16:44:25] thcipriani: and ftr that graph I linked was just appservers but you can switch clusters with the dropdown at the top -- it looks like we noticed the "after" spike across everything but at different amplitudes -- I'm still not convinced that isn't just sampling error [16:44:27] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10Cmjohnson) @ayounsi can I do this today? [16:45:26] dancy: I will discuss it with joe tomorrow again and get back to you for the final yes/no [16:45:34] rzl: ah, interesting, thanks [16:45:37] Great! Thanks everyone. [16:45:39] long term I'd love to run the same test five or ten times so we can get better statistics on what the impact actually is, but we don't need to cause that impact on purpose, we could just measure it from the next few deployments [16:46:10] (and/or we can probably get finer-grained data via logs anyway) [16:46:48] +1 re:logs [16:48:44] I'm excited since this has the potential to solve a few long-standing problems: opcache bit flips that have never been solved and complicated syncing/having to think about file order when syncing (which has caused a handful of problems in the past). One command to sync everything would make deployment training a little less terrifying for the uninitiated :) [16:49:11] not *not* terrifying, just less so [16:49:25] yeah totally -- and on k8s we'll be doing the equivalent of this *anyway* so it's nice to have an incremental step in that direction [16:49:26] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10RobH) >>! In T287838#7275903, @Cmjohnson wrote: > just tried these steps and the disk is not being seen. I need to reseat the disk and try again Updated directions to list ch... [16:49:31] +1 [16:50:02] ideally we can do something smart about letting a php-fpm instance drain its traffic gracefully before restarting it, which should cut down on the error rate, but that might be enough effort that it's a post-k8s thing [16:50:23] ^ this was why I was thinking about jobrunners [16:50:31] yeah [16:50:57] since they're doing long running tasks that get retried, we don't want to wait for those to drain, but appservers maybe that makes sense ¯\_(ツ)_/¯ [16:54:24] (03CR) 10Bstorm: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/711448 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:01:35] (03PS1) 10Jforrester: ContribsPager row filtering with RevisionStore::isRevisionRow [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710722 (https://phabricator.wikimedia.org/T288563) [17:03:17] jeena: The patch for the train from Platform is ready; should I merge and deploy it so we can see test wikis are fixed? [17:04:23] James_F: do you mean to roll train forward to group0? [17:05:18] If you like I don't see why not. I'm about to head into a meeting but you can ping me if you need me to do anything [17:05:30] jeena: I meant getting the patch fixed on test wikis first. :-) [17:05:55] oh sorry what was I thinking :P [17:05:56] (03CR) 10Bstorm: maintain-dbusers: delete users that are removed from ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:05:59] :-D [17:05:59] of course that is good too [17:06:05] Cool, will do that now. [17:07:27] (03CR) 10Jforrester: [C: 03+2] ContribsPager row filtering with RevisionStore::isRevisionRow [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710722 (https://phabricator.wikimedia.org/T288563) (owner: 10Jforrester) [17:07:31] (03CR) 10Bstorm: maintain-dbusers: delete users that are removed from ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:21:05] (03CR) 10Michael DiPietro: [C: 03+1] maintain-dbusers: delete users that are removed from ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:21:57] 10SRE, 10Services, 10Toolhub, 10serviceops, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [17:26:53] (03CR) 10Andrew Bogott: [C: 03+1] "This seems entirely straightforward. One thought about performance inline." [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:27:43] (03Merged) 10jenkins-bot: ContribsPager row filtering with RevisionStore::isRevisionRow [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710722 (https://phabricator.wikimedia.org/T288563) (owner: 10Jforrester) [17:28:16] Finally. [17:29:12] (03CR) 10Andrew Bogott: [C: 03+1] "I'm ready for this when you are!" [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [17:29:14] (03PS8) 10Jgiannelos: tegola: Add cronjob for tiles pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/701938 [17:32:12] (03CR) 10Bstorm: maintain-dbusers: delete users that are removed from ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:32:31] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.18/includes/Revision/RevisionStore.php: T288563 Don't explode Special:Contributions on extension-formatted rows (1/3) (duration: 01m 09s) [17:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:40] T288563: TypeError: Argument 1 passed to MediaWiki\Revision\RevisionStore::newRevisionFromRowAndSlots() must be an instance of stdClass - https://phabricator.wikimedia.org/T288563 [17:33:12] (03CR) 10Andrew Bogott: "This seems good, although I will want someone to keep an eye out and make sure it keeps backing up after the merge. Not sure I can volunt" [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:34:01] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.18/includes/Revision/RevisionFactory.php: T288563 Don't explode Special:Contributions on extension-formatted rows (2/3) (duration: 01m 08s) [17:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:08] (03PS2) 10Bstorm: maintain-dbusers: delete users that are removed from ldap [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) [17:35:11] (03CR) 10Andrew Bogott: [C: 03+1] maintain-dbusers: delete users that are removed from ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:35:42] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.18/includes/specials/pagers/ContribsPager.php: T288563 Don't explode Special:Contributions on extension-formatted rows (3/3) (duration: 01m 06s) [17:35:42] (03CR) 10Andrew Bogott: [C: 03+1] maintain-dbusers: delete users that are removed from ldap [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:13] (03CR) 10Bstorm: maintain-dbusers: delete users that are removed from ldap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:37:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:57] (03CR) 10Bstorm: "I figure after I watch this (with the dryrun arg set/existing) I'll know pretty quickly if it should be reverted." [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:38:39] jeena: Train should now be unblocked and safe to roll to group0 and then group1. [17:39:22] (03PS1) 10Zabe: labstore: remove absented archive_export_d cron [puppet] - 10https://gerrit.wikimedia.org/r/711623 (https://phabricator.wikimedia.org/T273673) [17:44:53] (03PS1) 10Legoktm: Add tokens and users for toolhub service [puppet] - 10https://gerrit.wikimedia.org/r/711624 (https://phabricator.wikimedia.org/T280881) [17:45:55] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: delete users that are removed from ldap [puppet] - 10https://gerrit.wikimedia.org/r/711234 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [17:49:26] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Trying it out :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711579 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [17:50:07] (03PS5) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) [17:50:15] (03PS3) 10Jforrester: Provide nodejs12-slim and -devel based on Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697672 (https://phabricator.wikimedia.org/T284346) [17:50:28] (03PS1) 10Legoktm: Add k8s users, tokens for toolhub service [labs/private] - 10https://gerrit.wikimedia.org/r/711625 (https://phabricator.wikimedia.org/T280881) [17:51:02] (03CR) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:59:53] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s users, tokens for toolhub service [labs/private] - 10https://gerrit.wikimedia.org/r/711625 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:05] jeena and twentyafterfour: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1800). [18:00:22] thank you James_F ! [18:00:29] (03CR) 10Legoktm: [C: 03+2] Add tokens and users for toolhub service [puppet] - 10https://gerrit.wikimedia.org/r/711624 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [18:00:41] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Aklapper) >>! In T288455#7270626, @Tgr wrote: >>>! In T288455#7270569, @thcipriani wrote: >> https://wikimedia.biterg.io/goto/1d62cdd781dbfa9f093dd9... [18:02:02] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@563f876]: process_sparql_query: increase parallelism to help backfill [18:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:27] (03PS1) 10Majavah: Add Toolhub public DNS name [dns] - 10https://gerrit.wikimedia.org/r/711637 (https://phabricator.wikimedia.org/T280881) [18:04:07] (03CR) 10jerkins-bot: [V: 04-1] Add Toolhub public DNS name [dns] - 10https://gerrit.wikimedia.org/r/711637 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [18:04:24] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@563f876]: process_sparql_query: increase parallelism to help backfill (duration: 02m 21s) [18:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:59] (03PS1) 10BBlack: Remove wikimedia.com zonefile [dns] - 10https://gerrit.wikimedia.org/r/711638 (https://phabricator.wikimedia.org/T281428) [18:05:20] (03PS2) 10Majavah: Add Toolhub public DNS name [dns] - 10https://gerrit.wikimedia.org/r/711637 (https://phabricator.wikimedia.org/T280881) [18:06:28] (03PS1) 10Legoktm: admin_ng: Add toolhub namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/711639 (https://phabricator.wikimedia.org/T280881) [18:06:43] 10SRE, 10ops-codfw: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) I open ticket #2045424 with CY1 to unplug both test PDU's in D8 for tomorrow. We will be putting back the old PDU for now until we received another test PDU. [18:07:30] (03PS1) 10Bstorm: maintain-dbusers: delete LDAP-absent accounts for real [puppet] - 10https://gerrit.wikimedia.org/r/711642 (https://phabricator.wikimedia.org/T285332) [18:10:05] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Aklapper) Note that people's names displayed on wikimedia.biterg.io are **not** necessarily their account usernames in some system (Gerrit, Phab, mw... [18:10:48] (03CR) 10Legoktm: [C: 03+2] admin_ng: Add toolhub namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/711639 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [18:10:57] (03PS1) 10Majavah: Add toolhub to cache backends [puppet] - 10https://gerrit.wikimedia.org/r/711648 (https://phabricator.wikimedia.org/T280881) [18:11:47] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10RhinosF1) We still need to map gerrit username to SUL account for it to be useful for votewiki [18:14:12] (03Merged) 10jenkins-bot: admin_ng: Add toolhub namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/711639 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [18:14:57] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10bd808) >>! In T288455#7276112, @Aklapper wrote: >>>! In T288455#7270626, @Tgr wrote: >>>>! In T288455#7270569, @thcipriani wrote: >>> https://wikime... [18:15:32] (03PS3) 10Bstorm: aptrepo: Drop thirdparty/kubeadm-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/710669 (owner: 10Majavah) [18:16:47] (03CR) 10Bstorm: [C: 03+2] aptrepo: Drop thirdparty/kubeadm-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/710669 (owner: 10Majavah) [18:19:25] !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:29] !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:22] !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:36] !log removed thirdparty/kubeadm-k8s-1-17 in reprepro [18:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:56] (03CR) 10Bstorm: "dropped in reprepro as well" [puppet] - 10https://gerrit.wikimedia.org/r/710669 (owner: 10Majavah) [18:22:18] !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:45] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:11] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:38] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:35] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:09] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Restarting to pick up Java security updates - hnowlan@cumin1001 [18:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:19] (03CR) 10Bstorm: [C: 03+1] "No reason not to do it (though it surprises me). However, if we cannot even trust the encoding of the files, we really should be using saf" [puppet] - 10https://gerrit.wikimedia.org/r/711106 (https://phabricator.wikimedia.org/T288508) (owner: 10David Caro) [18:37:16] (03CR) 10BBlack: [C: 03+2] Remove wikimedia.com zonefile [dns] - 10https://gerrit.wikimedia.org/r/711638 (https://phabricator.wikimedia.org/T281428) (owner: 10BBlack) [18:39:50] 10SRE, 10Services, 10Toolhub, 10serviceops, and 2 others: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) [18:39:56] 10SRE, 10Traffic, 10Patch-For-Review, 10Wikimedia Enterprise (Okapi Wikimedia Enterprise): "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10BBlack) 05Open→03Resolved [18:42:29] 10SRE, 10Services, 10Toolhub, 10serviceops, and 2 others: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) We're still missing the OAuth2 key/secret, but otherwise I think it should be possible to deploy to the staging/eqiad/codfw clusters now once the helmfile.d part is... [18:47:09] 10SRE: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Arnoldokoth) [18:47:36] (03CR) 10Legoktm: "Whichever TLS port you end up using, please add it to https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [18:59:19] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Schema:VirtualPageViews events alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [19:00:04] jeena and twentyafterfour: #bothumor I � Unicode. All rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1900). [19:05:47] (03PS1) 10Jeena Huneidi: group0 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711660 [19:05:49] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711660 (owner: 10Jeena Huneidi) [19:07:15] (03PS1) 10Btullis: Begin decommission of druid1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/711661 (https://phabricator.wikimedia.org/T255148) [19:08:41] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711660 (owner: 10Jeena Huneidi) [19:10:27] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.18 refs T281159 [19:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:35] T281159: 1.37.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T281159 [19:11:35] If everything seems good then I will proceed to deploy to group1 in the next 10-15 minutes [19:17:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @dzahn no worries, the on-stie work is done but needs firmware updates and the passwords reset. I'll have these for you NLT tomorrow. Regarding mw1444,... [19:19:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:09] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10RobH) IRC Update: Reseating the disk (Chris did so) did not fix this, as it doesn't fire off the redetection of disks automatically. I think if we follow the directions liste... [19:22:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:42] (03PS3) 10Ssingh: wikidough check: example authdns part [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [19:25:56] (03PS1) 10Jeena Huneidi: group1 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711663 [19:25:58] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711663 (owner: 10Jeena Huneidi) [19:26:59] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711663 (owner: 10Jeena Huneidi) [19:28:30] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.18 refs T281159 [19:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:39] T281159: 1.37.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T281159 [19:29:39] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.18 refs T281159 (duration: 01m 08s) [19:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:15] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Cmjohnson) @andrewbogott or @dcaro The disk did not show up as available even after attempting @robh's update. We will need to schedule downtime to reboot the server. [19:35:40] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:04] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286942 (10Cmjohnson) [19:36:34] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286943 (10Cmjohnson) [19:36:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:57] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286944 (10Cmjohnson) [19:37:31] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286945 (10Cmjohnson) [19:39:00] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286945 (10Cmjohnson) 05Open→03Resolved [19:39:19] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286944 (10Cmjohnson) 05Open→03Resolved [19:39:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:39:33] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286943 (10Cmjohnson) 05Open→03Resolved [19:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:53] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission payments1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T286942 (10Cmjohnson) 05Open→03Resolved [19:45:14] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) mw1267-mw1268 decom'd and removed [19:45:30] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:43] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10thcipriani) >>! In T288455#7276184, @bd808 wrote: >>>! In T288455#7276112, @Aklapper wrote: >>>>! In T288455#7270626, @Tgr wrote: >>>>>! In T288455#... [19:56:55] (03PS1) 10Jdlrobson: virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710723 [19:57:02] (03PS1) 10Jdlrobson: virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710724 [19:58:34] (03PS1) 10AOkoth: admin: Added aokoth [puppet] - 10https://gerrit.wikimedia.org/r/711673 (https://phabricator.wikimedia.org/T288645) [19:58:36] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/711673 (https://phabricator.wikimedia.org/T288645) (owner: 10AOkoth) [20:00:05] jeena and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1900). [20:00:05] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T2000). [20:00:10] (03PS2) 10Jdlrobson: virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710724 (https://phabricator.wikimedia.org/T288655) [20:00:21] (03PS2) 10Jdlrobson: virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710723 (https://phabricator.wikimedia.org/T288655) [20:01:53] (03CR) 10RLazarus: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/711673 (https://phabricator.wikimedia.org/T288645) (owner: 10AOkoth) [20:04:55] (03PS4) 10Ssingh: wikidough check: example authdns part [dns] - 10https://gerrit.wikimedia.org/r/711577 (owner: 10BBlack) [20:04:57] 10SRE, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Aklapper) @Arnoldokoth: Hi and welcome! :) Assuming your corresponding WMF SUL account is https://meta.wikimedia.org/wiki/Special:Log?page=User:AOkoth_(WMF) , could you please make https://phabricator.wikimedi... [20:05:36] mholloway: we've got permission from @jeena to do the backport now. [20:06:00] (03Abandoned) 10Ssingh: checkdoh: disable ECS for check subdomain [puppet] - 10https://gerrit.wikimedia.org/r/711578 (owner: 10BBlack) [20:06:04] Awesome, thanks jeena (and Jdlrobson)! [20:06:14] Jdlrobson: do you have a backport patch cooking or should I do the honors? [20:06:17] 👍 [20:06:26] I've just added you on them [20:06:29] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/710723 [20:06:41] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/710724 [20:06:51] There's 2 branches we have to think about unfortunately [20:09:35] (03CR) 10Mholloway: [C: 03+2] virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710724 (https://phabricator.wikimedia.org/T288655) (owner: 10Jdlrobson) [20:10:05] (03CR) 10Mholloway: [C: 03+2] virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710723 (https://phabricator.wikimedia.org/T288655) (owner: 10Jdlrobson) [20:13:19] jeena: my tea tasting set finally showed up, btw :) [20:14:08] (03PS1) 10AOkoth: admin: Add aokoth to gitlab-roots group [puppet] - 10https://gerrit.wikimedia.org/r/711680 (https://phabricator.wikimedia.org/T288645) [20:14:39] 10SRE, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10RLazarus) @Aklapper Good note -- it's in the onboarding docs for sure, we just haven't checked that step off yet. :) Thanks for calling it out, it'll be done shortly. [20:15:26] omg! It took that long?! [20:15:35] (03Merged) 10jenkins-bot: virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/710724 (https://phabricator.wikimedia.org/T288655) (owner: 10Jdlrobson) [20:16:10] lol [20:16:27] ok, wmf.18 patch is in, deploying now... [20:16:27] does the tea age well? [20:16:40] (03Merged) 10jenkins-bot: virtualPageView: Log VirtualPageView events to Event Platform [extensions/Popups] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710723 (https://phabricator.wikimedia.org/T288655) (owner: 10Jdlrobson) [20:16:48] (03CR) 10RLazarus: [C: 03+2] admin: Add aokoth to gitlab-roots group [puppet] - 10https://gerrit.wikimedia.org/r/711680 (https://phabricator.wikimedia.org/T288645) (owner: 10AOkoth) [20:16:55] it's vacuum sealed so it should be fine :P [20:17:17] plus it wasn't long enough to age hahaha [20:18:58] @mholloway let me know when I can verify above [20:19:07] I'll just double check the VirtualPageview HTTP request still occurs [20:19:12] and that the schema is right [20:19:48] i had it sent to my mom's house since that's where i was that week; i think it showed up... end of june? [20:19:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:20] !log mholloway-shell@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/Popups: Log VirtualPageView events to Event Platform (T288655) (duration: 01m 09s) [20:20:21] sat in customs at ohare a good, long time, i guess [20:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:26] T288655: VirtualPageViews instrumentation broken in 1.37.0-wmf.17 - https://phabricator.wikimedia.org/T288655 [20:20:27] ok, doing wmf.17 now [20:21:11] weird. It always arrives here on the west coast in about 2 weeks for me. Anyway if you want to try it together or have any questions lmk! [20:21:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:26] !log mholloway-shell@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/Popups: Log VirtualPageView events to Event Platform (T288655) (duration: 01m 06s) [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:35] Jdlrobson: done! [20:24:26] thanks mholloway... hopefully we'll see an influx of data shortly [20:27:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:27:53] Nothing yet.. I am hoping it's a 5 min cache problem.. [20:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:01] Yeah... [20:28:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:28:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:13] I am seeing the HTTP requests though pointing to events?hasty=true now [20:29:15] rather than beacon [20:30:24] !log [urbanecm@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki=wikimaniawiki --move-talk --add-prefix=T288643 --fix # T288643 [20:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:31] T288643: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T288643 [20:36:23] 10SRE, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10RLazarus) @Aklapper Ah sorry, I misunderstood -- that was done but you're absolutely right, there was confusion over which account to use. We've fixed it now, thanks again for mentioning, and I'll improve the... [20:58:14] !log legoktm@cumin1001 START - Cookbook sre.dns.netbox [20:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:52] jouncebot: now [20:59:52] For the next 0 hour(s) and 0 minute(s): MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T1900) [20:59:52] For the next 0 hour(s) and 0 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T2000) [21:00:07] jouncebot: next [21:00:07] In 1 hour(s) and 59 minute(s): Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T2300) [21:00:44] deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SpamBlacklist/+/711662 now [21:01:02] (03PS1) 10Ladsgroup: Avoid using deprecated WikiPage::prepareContentForEdit [extensions/SpamBlacklist] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710725 (https://phabricator.wikimedia.org/T288639) [21:01:22] (03PS1) 10Ladsgroup: Avoid using deprecated WikiPage::prepareContentForEdit [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711706 (https://phabricator.wikimedia.org/T288639) [21:01:33] (03CR) 10Ladsgroup: [C: 03+2] Avoid using deprecated WikiPage::prepareContentForEdit [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711706 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [21:03:31] (03CR) 10Ladsgroup: [C: 03+2] Avoid using deprecated WikiPage::prepareContentForEdit [extensions/SpamBlacklist] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710725 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [21:12:30] (03CR) 10Bstorm: [C: 03+2] "Discussed in IRC and seems ok" [puppet] - 10https://gerrit.wikimedia.org/r/711642 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [21:18:59] !log legoktm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:04] (03PS1) 10Bstorm: maintain-dbusers: old tools are missing and causing crashes [puppet] - 10https://gerrit.wikimedia.org/r/711693 (https://phabricator.wikimedia.org/T285332) [21:24:27] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: old tools are missing and causing crashes [puppet] - 10https://gerrit.wikimedia.org/r/711693 (https://phabricator.wikimedia.org/T285332) (owner: 10Bstorm) [21:24:35] (03Merged) 10jenkins-bot: Avoid using deprecated WikiPage::prepareContentForEdit [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711706 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [21:24:37] (03Merged) 10jenkins-bot: Avoid using deprecated WikiPage::prepareContentForEdit [extensions/SpamBlacklist] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710725 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [21:27:35] looks okay in mwdebug2002, innocent edits pass and spam edits fail [21:27:41] rolling to everywhere now [21:27:46] (wmf.18 for now) [21:29:29] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:711706|Avoid using deprecated WikiPage::prepareContentForEdit (T288639)]] (duration: 01m 07s) [21:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:36] T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice - https://phabricator.wikimedia.org/T288639 [21:30:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:28] (03PS1) 10Legoktm: Record shell outs in statsd [extensions/Score] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/711707 [21:36:37] (03PS1) 10Legoktm: Record shell outs in statsd [extensions/Score] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711708 [21:38:19] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:40:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:02] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:710725|Avoid using deprecated WikiPage::prepareContentForEdit (T288639)]] (duration: 01m 08s) [21:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:09] T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice - https://phabricator.wikimedia.org/T288639 [21:46:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:25] (03PS2) 10Legoktm: Add toolhub to cache backends [puppet] - 10https://gerrit.wikimedia.org/r/711648 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [21:48:27] (03PS1) 10Legoktm: Add toolhub to LVS [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) [21:48:29] (03PS1) 10Legoktm: service: Switch toolhub to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/711703 (https://phabricator.wikimedia.org/T280881) [21:48:31] (03PS1) 10Legoktm: service: Switch toolhub to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/711704 (https://phabricator.wikimedia.org/T280881) [21:48:33] (03PS1) 10Legoktm: service: Switch toolhub to production [puppet] - 10https://gerrit.wikimedia.org/r/711705 (https://phabricator.wikimedia.org/T280881) [21:48:35] (03PS3) 10Legoktm: Add Toolhub public DNS name [dns] - 10https://gerrit.wikimedia.org/r/711637 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [21:48:37] (03PS1) 10Legoktm: Add toolhub.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/711726 (https://phabricator.wikimedia.org/T280881) [21:48:39] (03PS1) 10Legoktm: Add toolhub to discovery [dns] - 10https://gerrit.wikimedia.org/r/711727 (https://phabricator.wikimedia.org/T280881) [21:51:58] (03CR) 10Legoktm: [C: 03+2] Add toolhub.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/711726 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [21:58:15] (03PS6) 10BryanDavis: toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) [21:58:17] (03PS3) 10BryanDavis: toolhub: Add CronJob for crawler [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) [21:59:38] (03CR) 10Legoktm: "Is it OK if the crawler is running in both DCs? Or should it be only active in one spot?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [22:01:09] (03CR) 10BryanDavis: toolhub: initial chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [22:01:52] (03PS1) 10Legoktm: service: Enable paging for shellbox-constraints service [puppet] - 10https://gerrit.wikimedia.org/r/711737 [22:02:46] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10odimitrijevic) @herron can this task be closed out and possibly create a new cleanup the old hosts if this work still needs to be done? [22:03:52] (03CR) 10Legoktm: [C: 03+2] Record shell outs in statsd [extensions/Score] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/711707 (owner: 10Legoktm) [22:03:54] (03CR) 10Legoktm: [C: 03+2] Record shell outs in statsd [extensions/Score] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711708 (owner: 10Legoktm) [22:04:05] (03CR) 10BryanDavis: toolhub: Add CronJob for crawler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [22:18:23] (03PS1) 10Cwhite: profile: improve kafka_shipper rsyslog output ssl options [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) [22:21:04] (03CR) 10Legoktm: toolhub: Add CronJob for crawler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [22:22:35] (03PS2) 10Cwhite: profile: improve kafka_shipper rsyslog output ssl options [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) [22:24:41] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/30556/" [puppet] - 10https://gerrit.wikimedia.org/r/711741 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [22:25:22] (03Merged) 10jenkins-bot: Record shell outs in statsd [extensions/Score] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/711707 (owner: 10Legoktm) [22:27:17] (03Merged) 10jenkins-bot: Record shell outs in statsd [extensions/Score] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711708 (owner: 10Legoktm) [22:30:04] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/Score/includes/Score.php: Record shell outs in statsd (duration: 01m 08s) [22:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:21] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/Score/includes/Score.php: Record shell outs in statsd (duration: 01m 07s) [22:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:44] (03CR) 10BryanDavis: toolhub: Add CronJob for crawler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210811T2300). Please do the needful. [23:00:04] ebernhardson: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:06] ebernhardson: hey, will you self-service the deployment? [23:01:19] sure, i need to delay like 20 minutes between the patches anyways [23:01:53] (03PS2) 10Ebernhardson: [cirrus] switch more_like traffic to codfw 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704389 (owner: 10DCausse) [23:02:02] (03CR) 10Ebernhardson: [C: 03+2] [cirrus] switch more_like traffic to codfw 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704389 (owner: 10DCausse) [23:02:34] Ack :) [23:02:48] (03Merged) 10jenkins-bot: [cirrus] switch more_like traffic to codfw 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704389 (owner: 10DCausse) [23:04:31] (03PS1) 10BryanDavis: switchdc: Exclude toolhub, lacking active/active db [cookbooks] - 10https://gerrit.wikimedia.org/r/711763 (https://phabricator.wikimedia.org/T288685) [23:06:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:06:38] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cirrus: switch more_like traffic to codfw 1/2 (duration: 01m 08s) [23:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:41] (03CR) 10Legoktm: [C: 03+2] switchdc: Exclude toolhub, lacking active/active db [cookbooks] - 10https://gerrit.wikimedia.org/r/711763 (https://phabricator.wikimedia.org/T288685) (owner: 10BryanDavis) [23:11:23] (03Merged) 10jenkins-bot: switchdc: Exclude toolhub, lacking active/active db [cookbooks] - 10https://gerrit.wikimedia.org/r/711763 (https://phabricator.wikimedia.org/T288685) (owner: 10BryanDavis) [23:20:00] (03PS2) 10Ebernhardson: [cirrus] switch more_like traffic to codfw 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704390 (owner: 10DCausse) [23:20:08] (03CR) 10Ebernhardson: [C: 03+2] [cirrus] switch more_like traffic to codfw 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704390 (owner: 10DCausse) [23:20:26] (03PS3) 10Acamicamacaraca: Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) [23:20:41] (03PS4) 10Acamicamacaraca: Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) [23:21:23] (03Merged) 10jenkins-bot: [cirrus] switch more_like traffic to codfw 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704390 (owner: 10DCausse) [23:22:25] (03PS5) 10Acamicamacaraca: Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) [23:24:37] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cirrus: switch more_like traffic to codfw 2/2 (duration: 01m 08s) [23:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log