[00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T0000). [00:03:30] (03PS1) 10Jbond: controller: add supper for multiple statement types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 [00:05:07] (03CR) 10jerkins-bot: [V: 04-1] controller: add supper for multiple statement types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 (owner: 10Jbond) [00:05:23] (03CR) 10Jbond: controller: add supper for multiple statement types (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 (owner: 10Jbond) [00:24:48] (03PS2) 10Jbond: controller: add supper for multiple statement types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 [00:45:08] (03PS3) 10Jbond: controller: add supper for multiple statement types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 [01:11:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [01:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:34] PROBLEM - Check systemd state on install1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_atftpd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [02:07:00] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [02:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:32] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [02:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:54] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) For the record: reimaging this host worked properly on the 14th after Arzhel applied the suggested hack. Today, though, th... [02:25:19] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) Great news, everybody! For a sanity check I just now tried to re-image cloudvirt1016, and it won't pxe-boot either. That suggests that t... [02:27:15] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:29:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 672 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:35:02] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 672 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:57:39] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [02:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:32] (03CR) 10Samtar: [C: 04-1] "-1 to prevent merge per comments by Lucas and T303665#7775892" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770120 (https://phabricator.wikimedia.org/T303665) (owner: 10Samtar) [03:18:37] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:32:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:34:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:06:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:06:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T300775)', diff saved to https://phabricator.wikimedia.org/P22737 and previous config saved to /var/cache/conftool/dbconfig/20220317-040634-marostegui.json [04:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:38] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [04:20:23] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:51:13] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:54:05] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [06:11:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22738 and previous config saved to /var/cache/conftool/dbconfig/20220317-061144-root.json [06:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:15:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22739 and previous config saved to /var/cache/conftool/dbconfig/20220317-062648-root.json [06:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:19] (03PS1) 10Marostegui: Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/771397 [06:29:02] (03CR) 10Marostegui: [C: 03+2] Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/771397 (owner: 10Marostegui) [06:39:21] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:41:01] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:41:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22740 and previous config saved to /var/cache/conftool/dbconfig/20220317-064152-root.json [06:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:11] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:33] !log kill remaining hanging processes for ppche*lko and accra*ze on an-test-client1001 to allow users offboard (puppet broken) [06:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:53] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:50:08] !log restart blazegraph on wdqs2004 [06:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:43] RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.716 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:51:05] dcausse: --^ [06:52:17] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:52:41] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:42] elukey: dcausse: ty, glancing at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&refresh=1m `wdqs200[1,3]` are failing to report as well, I'm restarting those two [06:53:19] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:53:21] !log [WDQS] `ryankemper@wdqs2001:~$ sudo systemctl restart wdqs-blazegraph.service` [06:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:07] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:54:16] !log [WDQS] `ryankemper@wdqs2003:~$ sudo systemctl restart wdqs-blazegraph.service` [06:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22741 and previous config saved to /var/cache/conftool/dbconfig/20220317-065656-root.json [06:56:59] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:03] !log [WDQS] Note that per https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1647457172391&to=1647500081971&viewPanel=7 `wdqs2003` has been offline for ~6 hours, `wdqs2001` for 1.5 hours and `wdqs2004` just recently. [06:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:14] !log [WDQS] Also of note is the spiking thread counts on the affected hosts: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1647457172391&to=1647500081971&viewPanel=22 [06:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:43] (03PS1) 10Marostegui: change_oi_timestamp_T298556.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/771511 (https://phabricator.wikimedia.org/T298556) [06:59:58] (03Abandoned) 10Giuseppe Lavagetto: C:varnish: use X-Abuse-Network [puppet] - 10https://gerrit.wikimedia.org/r/769902 (owner: 10Giuseppe Lavagetto) [07:00:04] Amir1 and apergos: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T0700). [07:00:16] morning. [07:00:22] o/ [07:00:31] looks like nothing to do. [07:00:31] no trainees signed up, no patches in the window [07:00:37] yep [07:00:57] good thing with this slot being an hour earlier during daylight savings messup time [07:01:23] (03CR) 10Majavah: P:scap::dsh: Add scap targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [07:04:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:04:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [07:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [07:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P22742 and previous config saved to /var/cache/conftool/dbconfig/20220317-070650-root.json [07:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:51] !log [WDQS] Depooled `wdqs2003` (8 hours of lag to catch up on) [07:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22743 and previous config saved to /var/cache/conftool/dbconfig/20220317-071200-root.json [07:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22744 and previous config saved to /var/cache/conftool/dbconfig/20220317-072154-root.json [07:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:46] !log dbmaint on s5@eqiad T297189 [07:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:52] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [07:36:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22745 and previous config saved to /var/cache/conftool/dbconfig/20220317-073658-root.json [07:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:46] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [07:50:19] (03PS3) 10Majavah: P:wmcs::prometheus: set team: wmcs on all alerts [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) [07:51:29] (03CR) 10Majavah: P:wmcs::prometheus: set team: wmcs on all alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [07:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22746 and previous config saved to /var/cache/conftool/dbconfig/20220317-075201-root.json [07:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:53:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298557)', diff saved to https://phabricator.wikimedia.org/P22747 and previous config saved to /var/cache/conftool/dbconfig/20220317-075350-marostegui.json [07:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:54] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:54:01] (03CR) 10Filippo Giunchedi: [C: 03+2] P:wmcs::prometheus: set team: wmcs on all alerts [puppet] - 10https://gerrit.wikimedia.org/r/771384 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [08:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22748 and previous config saved to /var/cache/conftool/dbconfig/20220317-080705-root.json [08:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:28] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Accraze out of all services on: 1881 hosts [08:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Accraze out of all services on: 1881 hosts [08:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:16] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Ppchelko out of all services on: 1881 hosts [08:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ppchelko out of all services on: 1881 hosts [08:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:16] jouncebot: nowandnext [08:19:16] No deployments scheduled for the next 1 hour(s) and 40 minute(s) [08:19:16] In 1 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1000) [08:20:45] (03CR) 10Urbanecm: [C: 03+2] Throttle: Increase limit for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771477 (https://phabricator.wikimedia.org/T304016) (owner: 10Samtar) [08:20:51] (03PS1) 10Urbanecm: throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771545 [08:21:02] (03CR) 10Urbanecm: [C: 03+2] throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771545 (owner: 10Urbanecm) [08:21:21] (03Merged) 10jenkins-bot: Throttle: Increase limit for English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771477 (https://phabricator.wikimedia.org/T304016) (owner: 10Samtar) [08:21:39] (03Merged) 10jenkins-bot: throttle: Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771545 (owner: 10Urbanecm) [08:23:25] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 980ea35d454563e538d08b9d6462064455b4d28c: Throttle: Increase limit for English Wikipedia (T304016) (duration: 00m 51s) [08:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:30] T304016: Temporary lift on English Wikipedia account creation for specific IP address - https://phabricator.wikimedia.org/T304016 [08:24:37] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 0da40c22844746120de9b33e772598d38aa74326: throttle: Remove expired rules (duration: 00m 50s) [08:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:29] (03PS1) 10Urbanecm: Initial configuration for shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771546 (https://phabricator.wikimedia.org/T302797) [08:39:28] (03PS1) 10Urbanecm: Initial configuration for guwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771547 (https://phabricator.wikimedia.org/T303727) [08:39:56] (03CR) 10Hashar: "The test failure is in MediaWikiIntegrationTestCase" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (owner: 10Aaron Schulz) [08:40:26] (03PS1) 10Hashar: tests: Fix @group Broken on MediaWikiIntegrationTestCaseSchemaTest [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771398 (https://phabricator.wikimedia.org/T292239) [08:41:25] (03CR) 10Volans: "LGTM (without ES specific context). Reply to question inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [08:41:54] (03PS2) 10Hashar: rdbms: use the LoadBalancer id in flushPrimarySessions() [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (owner: 10Aaron Schulz) [08:42:58] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs for kubernetes2015 [puppet] - 10https://gerrit.wikimedia.org/r/771422 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:43:05] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs for kubernetes2016 [puppet] - 10https://gerrit.wikimedia.org/r/771423 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:44:53] (03CR) 10JMeybohm: [C: 03+2] Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932) (owner: 10JMeybohm) [08:45:12] 10SRE, 10Traffic: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10fgiunchedi) [08:46:10] (03CR) 10Muehlenhoff: [C: 03+2] Update point of contact for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/770910 (https://phabricator.wikimedia.org/T294484) (owner: 10Muehlenhoff) [08:47:04] (03Merged) 10jenkins-bot: Change certificate expiry thresholds to 9d warn and 7d critical [alerts] - 10https://gerrit.wikimedia.org/r/771361 (https://phabricator.wikimedia.org/T303932) (owner: 10JMeybohm) [08:51:14] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Clarakosi out of all services on: 1881 hosts [08:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Clarakosi out of all services on: 1881 hosts [08:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:44] (03PS1) 10Muehlenhoff: Remove access for clarakosi [puppet] - 10https://gerrit.wikimedia.org/r/771548 [08:54:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for clarakosi [puppet] - 10https://gerrit.wikimedia.org/r/771548 (owner: 10Muehlenhoff) [08:55:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298557)', diff saved to https://phabricator.wikimedia.org/P22749 and previous config saved to /var/cache/conftool/dbconfig/20220317-085502-marostegui.json [08:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:07] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:59:26] (03CR) 10Ayounsi: [C: 03+2] definitions: add drmrs to wikimedia-private [homer/public] - 10https://gerrit.wikimedia.org/r/771438 (owner: 10Ssingh) [08:59:40] (03CR) 10Ayounsi: [C: 03+2] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/771438 (owner: 10Ssingh) [09:00:00] (03Merged) 10jenkins-bot: definitions: add drmrs to wikimedia-private [homer/public] - 10https://gerrit.wikimedia.org/r/771438 (owner: 10Ssingh) [09:00:32] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Dale_Zhou - https://phabricator.wikimedia.org/T303702 (10MGerlach) @BBlack (pinging you as per [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty#Schedule | clinic-duty schedule ]], apologies if you are not the right contact): I wanted to check if... [09:01:22] (03Abandoned) 10Muehlenhoff: Also include staging server in analytics-tools Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/766567 (owner: 10Muehlenhoff) [09:05:47] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:08:39] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:38] (03CR) 10Ayounsi: "FYI, this definition is only used in management firewalls. so it's noop for prod flows (eg. varnishkafka)." [homer/public] - 10https://gerrit.wikimedia.org/r/771438 (owner: 10Ssingh) [09:10:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P22750 and previous config saved to /var/cache/conftool/dbconfig/20220317-091007-marostegui.json [09:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:19:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T297189)', diff saved to https://phabricator.wikimedia.org/P22751 and previous config saved to /var/cache/conftool/dbconfig/20220317-091911-marostegui.json [09:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:15] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [09:25:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P22752 and previous config saved to /var/cache/conftool/dbconfig/20220317-092512-marostegui.json [09:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:09] (03PS2) 10ArielGlenn: Handle exceptions from getting web requests properly [puppet] - 10https://gerrit.wikimedia.org/r/768045 (https://phabricator.wikimedia.org/T302930) [09:35:54] (03CR) 10Btullis: [C: 03+1] "Looks good. This is my first real exposure to wmf-auto-restart but I like it." [puppet] - 10https://gerrit.wikimedia.org/r/767717 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:36:01] (03CR) 10ArielGlenn: [C: 03+2] Handle exceptions from getting web requests properly [puppet] - 10https://gerrit.wikimedia.org/r/768045 (https://phabricator.wikimedia.org/T302930) (owner: 10ArielGlenn) [09:37:20] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498 (10Marostegui) I haven't been able to reproduce a crash again. Before closing this, I am going to upgrade a few more hosts a... [09:40:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298557)', diff saved to https://phabricator.wikimedia.org/P22754 and previous config saved to /var/cache/conftool/dbconfig/20220317-094017-marostegui.json [09:40:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [09:40:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [09:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:23] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298557)', diff saved to https://phabricator.wikimedia.org/P22755 and previous config saved to /var/cache/conftool/dbconfig/20220317-094025-marostegui.json [09:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:13] (03CR) 10Ladsgroup: [C: 03+1] change_oi_timestamp_T298556.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/771511 (https://phabricator.wikimedia.org/T298556) (owner: 10Marostegui) [09:43:32] (03CR) 10Marostegui: [C: 03+2] change_oi_timestamp_T298556.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/771511 (https://phabricator.wikimedia.org/T298556) (owner: 10Marostegui) [09:45:32] (03PS2) 10Jcrespo: dbbackups: Migrate remote backup (snapshot) cmd line to 0.7 format [puppet] - 10https://gerrit.wikimedia.org/r/770023 (https://phabricator.wikimedia.org/T138562) [09:45:34] (03PS1) 10Jcrespo: bacula: Add mixed priority to all jobs [puppet] - 10https://gerrit.wikimedia.org/r/771551 (https://phabricator.wikimedia.org/T95705) [09:45:38] (03Merged) 10jenkins-bot: change_oi_timestamp_T298556.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/771511 (https://phabricator.wikimedia.org/T298556) (owner: 10Marostegui) [09:46:49] (03PS2) 10Jcrespo: bacula: Add mixed priority to all jobs [puppet] - 10https://gerrit.wikimedia.org/r/771551 (https://phabricator.wikimedia.org/T95705) [09:47:33] (03PS3) 10Jcrespo: bacula: Add mixed priority to all jobs [puppet] - 10https://gerrit.wikimedia.org/r/771551 (https://phabricator.wikimedia.org/T95705) [09:50:08] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes2015 [puppet] - 10https://gerrit.wikimedia.org/r/771422 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:50:14] (03PS2) 10Elukey: Set bullseye + overlayfs for kubernetes2015 [puppet] - 10https://gerrit.wikimedia.org/r/771422 (https://phabricator.wikimedia.org/T300744) [09:50:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:50:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298556)', diff saved to https://phabricator.wikimedia.org/P22756 and previous config saved to /var/cache/conftool/dbconfig/20220317-095044-marostegui.json [09:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [09:52:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298556)', diff saved to https://phabricator.wikimedia.org/P22757 and previous config saved to /var/cache/conftool/dbconfig/20220317-095204-marostegui.json [09:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:05] (03CR) 10Muehlenhoff: systemd: Add new define to manage user service environments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond) [09:53:54] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) Juniper haven't found a related bug and are in the process of re-creating our setup in their lab to try to replicate the issue. I won't be online m... [10:00:04] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1000). [10:00:07] (03PS2) 10Elukey: Set bullseye + overlayfs for kubernetes2016 [puppet] - 10https://gerrit.wikimedia.org/r/771423 (https://phabricator.wikimedia.org/T300744) [10:00:09] (03PS1) 10Elukey: Set bullseye in dhcp config for kubernetes201[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/771552 (https://phabricator.wikimedia.org/T300744) [10:02:35] (03CR) 10Elukey: [C: 03+2] Set bullseye in dhcp config for kubernetes201[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/771552 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:03:51] (03CR) 10JMeybohm: [C: 03+1] Add dummy deployment user/tokens for datahub [labs/private] - 10https://gerrit.wikimedia.org/r/771363 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [10:05:55] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [10:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22758 and previous config saved to /var/cache/conftool/dbconfig/20220317-100709-marostegui.json [10:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:19] (03PS1) 10Muehlenhoff: mediabackup::storage: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/771560 [10:10:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff) [10:10:47] !log dbmaint on s7@eqiad T298556 [10:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:51] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [10:12:11] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy deployment user/tokens for datahub [labs/private] - 10https://gerrit.wikimedia.org/r/771363 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [10:15:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:50] this is me --^ (Reimaging a node) [10:17:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22759 and previous config saved to /var/cache/conftool/dbconfig/20220317-102214-marostegui.json [10:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:13] !log dbmaint on s3@codfw T298556 [10:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:17] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [10:26:11] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-fe[1005-1008].eqiad.wmnet [10:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:40] (NodeTextfileStale) resolved: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [10:31:55] !log mvernon@cumin1001 START - Cookbook sre.dns.netbox [10:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298556)', diff saved to https://phabricator.wikimedia.org/P22760 and previous config saved to /var/cache/conftool/dbconfig/20220317-103719-marostegui.json [10:37:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:37:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:24] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [10:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298556)', diff saved to https://phabricator.wikimedia.org/P22761 and previous config saved to /var/cache/conftool/dbconfig/20220317-103726-marostegui.json [10:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298556)', diff saved to https://phabricator.wikimedia.org/P22762 and previous config saved to /var/cache/conftool/dbconfig/20220317-103844-marostegui.json [10:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:40:36] !log mvernon@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:01] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298557)', diff saved to https://phabricator.wikimedia.org/P22763 and previous config saved to /var/cache/conftool/dbconfig/20220317-104358-marostegui.json [10:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:02] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [10:47:23] !log dbmaint on s3@eqiad T298556 [10:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:28] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [10:50:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-fe[1005-1008].eqiad.wmnet [10:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:13] 10SRE-swift-storage: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1001 for hosts: `ms-fe[1005-1008].eqiad.wmnet` - ms-fe1005.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found physic... [10:53:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P22764 and previous config saved to /var/cache/conftool/dbconfig/20220317-105349-marostegui.json [10:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:38] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes2016 [puppet] - 10https://gerrit.wikimedia.org/r/771423 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:57:40] (03PS1) 10Btullis: Add dummy secrets for datahub deployment [labs/private] - 10https://gerrit.wikimedia.org/r/771563 (https://phabricator.wikimedia.org/T303049) [10:58:31] (03PS2) 10Btullis: Add dummy secrets for datahub deployment [labs/private] - 10https://gerrit.wikimedia.org/r/771563 (https://phabricator.wikimedia.org/T303049) [10:59:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P22765 and previous config saved to /var/cache/conftool/dbconfig/20220317-105903-marostegui.json [10:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:26] (KubernetesCalicoDown) firing: kubernetes2016.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:03:45] this is me --^ [11:03:49] adding downtime [11:05:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [11:05:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [11:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298556)', diff saved to https://phabricator.wikimedia.org/P22766 and previous config saved to /var/cache/conftool/dbconfig/20220317-110536-marostegui.json [11:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:41] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [11:06:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P22767 and previous config saved to /var/cache/conftool/dbconfig/20220317-110645-root.json [11:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:06] (03PS1) 10Alexandros Kosiaris: Add kubernetes1018-1022 as BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/771564 (https://phabricator.wikimedia.org/T293728) [11:11:37] (03CR) 10Elukey: [C: 03+1] "Checked all IPs and syntax, LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/771564 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [11:13:59] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.3.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/771566 [11:14:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P22768 and previous config saved to /var/cache/conftool/dbconfig/20220317-111408-marostegui.json [11:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:23] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.3.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/771566 (owner: 10Volans) [11:18:43] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [11:20:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22769 and previous config saved to /var/cache/conftool/dbconfig/20220317-112004-root.json [11:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:33] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.3.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/771566 (owner: 10Volans) [11:21:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22770 and previous config saved to /var/cache/conftool/dbconfig/20220317-112148-root.json [11:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:39] (03PS1) 10Volans: Upstream release v2.3.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/771567 [11:22:41] (03PS4) 10Ladsgroup: idp: Open up orchestrator to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [11:22:45] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] idp: Open up orchestrator to cumin host [puppet] - 10https://gerrit.wikimedia.org/r/770981 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [11:23:24] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:23:24] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:24:32] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:24:32] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:24:37] (03PS1) 10Ladsgroup: Revert "idp: Open up orchestrator to cumin host" [puppet] - 10https://gerrit.wikimedia.org/r/771399 [11:24:48] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "idp: Open up orchestrator to cumin host" [puppet] - 10https://gerrit.wikimedia.org/r/771399 (owner: 10Ladsgroup) [11:26:37] (03CR) 10Volans: [C: 03+2] Upstream release v2.3.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/771567 (owner: 10Volans) [11:26:53] (03PS2) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 [11:26:55] (03PS1) 10Jbond: P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) [11:27:11] (03PS1) 10David Caro: openstack: add neutron-rpc-server and ensure up [puppet] - 10https://gerrit.wikimedia.org/r/771569 (https://phabricator.wikimedia.org/T302369) [11:27:28] (03CR) 10Ladsgroup: "Thank you for this!" [puppet] - 10https://gerrit.wikimedia.org/r/754114 (https://phabricator.wikimedia.org/T236954) (owner: 10JHathaway) [11:28:54] (03CR) 10jerkins-bot: [V: 04-1] P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:29:01] (03CR) 10jerkins-bot: [V: 04-1] systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond) [11:29:11] (03PS2) 10David Caro: openstack: add neutron-rpc-server and ensure up [puppet] - 10https://gerrit.wikimedia.org/r/771569 (https://phabricator.wikimedia.org/T302369) [11:29:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298557)', diff saved to https://phabricator.wikimedia.org/P22771 and previous config saved to /var/cache/conftool/dbconfig/20220317-112913-marostegui.json [11:29:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [11:29:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [11:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:18] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298557)', diff saved to https://phabricator.wikimedia.org/P22772 and previous config saved to /var/cache/conftool/dbconfig/20220317-112921-marostegui.json [11:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:19] (03PS3) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 [11:30:47] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34385/console" [puppet] - 10https://gerrit.wikimedia.org/r/771569 (https://phabricator.wikimedia.org/T302369) (owner: 10David Caro) [11:32:14] (03CR) 10jerkins-bot: [V: 04-1] systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond) [11:32:16] (03CR) 10David Caro: [V: 03+1] "PCC looks as expected, just adding the service running on cloudcontrtols, and no changes on cloudvirts." [puppet] - 10https://gerrit.wikimedia.org/r/771569 (https://phabricator.wikimedia.org/T302369) (owner: 10David Caro) [11:32:24] (03PS1) 10Ladsgroup: idp: Open up orchestrator to cumin host, take II [puppet] - 10https://gerrit.wikimedia.org/r/771570 (https://phabricator.wikimedia.org/T281249) [11:32:37] (03PS2) 10Jbond: P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) [11:32:43] (03Merged) 10jenkins-bot: Upstream release v2.3.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/771567 (owner: 10Volans) [11:34:13] (03CR) 10jerkins-bot: [V: 04-1] P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:35:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22773 and previous config saved to /var/cache/conftool/dbconfig/20220317-113508-root.json [11:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:50] (03CR) 10Jcrespo: "There is already a yaml constant that would be useful here: mysql_root_clients" [puppet] - 10https://gerrit.wikimedia.org/r/771570 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [11:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22774 and previous config saved to /var/cache/conftool/dbconfig/20220317-113652-root.json [11:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:56] (03PS1) 10Klausman: Clean out bashrc, vimrc et al [puppet] - 10https://gerrit.wikimedia.org/r/771571 [11:38:06] (03CR) 10Jbond: C:varnish: create rate limit keyed on the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [11:38:25] (03PS2) 10Klausman: admin/klausman Clean out bashrc, vimrc et al [puppet] - 10https://gerrit.wikimedia.org/r/771571 [11:38:39] (03PS3) 10Jbond: P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) [11:39:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:25] (03PS3) 10Klausman: admin/klausman Clean out bashrc, vimrc et al [puppet] - 10https://gerrit.wikimedia.org/r/771571 [11:40:12] (03CR) 10jerkins-bot: [V: 04-1] P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:40:38] (03PS4) 10Jbond: P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) [11:40:47] (03PS4) 10Klausman: admin/klausman Clean out bashrc, vimrc et al [puppet] - 10https://gerrit.wikimedia.org/r/771571 [11:40:49] (03PS2) 10Ladsgroup: idp: Open up orchestrator to cumin host, take II [puppet] - 10https://gerrit.wikimedia.org/r/771570 (https://phabricator.wikimedia.org/T281249) [11:40:57] (03CR) 10Klausman: [C: 03+2] admin/klausman Clean out bashrc, vimrc et al [puppet] - 10https://gerrit.wikimedia.org/r/771571 (owner: 10Klausman) [11:41:17] !log uploaded spicerack_2.3.3 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [11:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34390/console" [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:42:33] (03PS4) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 [11:42:38] (03CR) 10Majavah: [C: 04-1] P:environment: Add support for environment.d to zsh and bash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:42:59] !log upgrades spicerack on cumin hosts to v2.3.3 [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:16] (03CR) 10Ladsgroup: [C: 03+2] idp: Open up orchestrator to cumin host, take II (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771570 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [11:43:20] (03PS1) 10MVernon: codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771573 (https://phabricator.wikimedia.org/T303507) [11:43:54] (03CR) 10MVernon: [V: 03+2 C: 03+2] "DIY merge, because routine." [software/swift-ring] - 10https://gerrit.wikimedia.org/r/771573 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [11:45:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:04] (03CR) 10David Caro: [C: 03+1] "LGTM nice!" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 (owner: 10Jbond) [11:47:36] (03PS1) 10Ladsgroup: Revert "idp: Open up orchestrator to cumin host" [puppet] - 10https://gerrit.wikimedia.org/r/771400 [11:47:42] (03Abandoned) 10Ladsgroup: Revert "idp: Open up orchestrator to cumin host" [puppet] - 10https://gerrit.wikimedia.org/r/771400 (owner: 10Ladsgroup) [11:48:08] (03PS1) 10Ladsgroup: Revert "idp: Open up orchestrator to cumin host, take II" [puppet] - 10https://gerrit.wikimedia.org/r/771401 [11:48:15] (03PS2) 10Ladsgroup: Revert "idp: Open up orchestrator to cumin host, take II" [puppet] - 10https://gerrit.wikimedia.org/r/771401 [11:48:19] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "idp: Open up orchestrator to cumin host, take II" [puppet] - 10https://gerrit.wikimedia.org/r/771401 (owner: 10Ladsgroup) [11:49:52] (03PS5) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 [11:50:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22775 and previous config saved to /var/cache/conftool/dbconfig/20220317-115012-root.json [11:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:38] (03PS1) 10Jbond: P:environment: enable export_systemd_env in cloud [puppet] - 10https://gerrit.wikimedia.org/r/771576 [11:51:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22776 and previous config saved to /var/cache/conftool/dbconfig/20220317-115156-root.json [11:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:14] 10SRE-swift-storage, 10ops-eqiad, 10decommission-hardware: Decommission ms-fe100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T304064 (10MatthewVernon) [11:53:27] (03PS5) 10Jbond: P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) [11:53:28] 10SRE-swift-storage: Decommission ms-fe100[5-8] - https://phabricator.wikimedia.org/T303733 (10MatthewVernon) 05Open→03Resolved This is now done, handed to DC team for decommissioning of hardware at T304064. [11:53:58] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:54:25] (03PS2) 10Jbond: P:environment: enable export_systemd_env in cloud [puppet] - 10https://gerrit.wikimedia.org/r/771576 [11:56:09] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) So that patch the supposed to open it up, broke it in two di... [11:56:41] (03PS28) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [11:56:43] (03PS17) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [11:58:19] (03PS6) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 [12:01:48] (03PS3) 10Jbond: P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415 [12:04:42] (03PS3) 10Jbond: P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [12:07:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22777 and previous config saved to /var/cache/conftool/dbconfig/20220317-120700-root.json [12:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:40] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Apache on an-web [puppet] - 10https://gerrit.wikimedia.org/r/767717 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:18:48] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:18:50] (03PS6) 10Jbond: P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) [12:19:25] (03PS4) 10Jbond: controller: add supper for multiple statement types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 [12:19:28] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:07] (03CR) 10Jbond: "tested manually with the following" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 (owner: 10Jbond) [12:24:30] (03PS5) 10Jbond: controller: add support for multiple statement types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 [12:25:04] (03CR) 10Jbond: [C: 03+2] controller: add support for multiple statement types (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 (owner: 10Jbond) [12:26:27] (03Merged) 10jenkins-bot: controller: add support for multiple statement types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 (owner: 10Jbond) [12:27:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298557)', diff saved to https://phabricator.wikimedia.org/P22778 and previous config saved to /var/cache/conftool/dbconfig/20220317-122704-marostegui.json [12:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:09] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:28:10] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:31] (03PS1) 10Jbond: C:puppet_compiler: bump version to 2.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/771592 [12:28:47] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: bump version to 2.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/771592 (owner: 10Jbond) [12:32:40] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) Would prefer to proceed if possible, just want to finish somet... [12:40:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34394/console" [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [12:42:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P22779 and previous config saved to /var/cache/conftool/dbconfig/20220317-124209-marostegui.json [12:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:40] (03CR) 10Btullis: "Looking good." [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [12:52:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:55:53] (03PS1) 104nn1l2: commonswiki: Add pictures.snsb.info to wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771595 (https://phabricator.wikimedia.org/T303929) [12:57:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P22780 and previous config saved to /var/cache/conftool/dbconfig/20220317-125715-marostegui.json [12:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1300). [13:00:05] Lucas_WMDE and nn1l2: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] hi [13:00:15] o/ [13:00:20] I can deploy [13:00:27] * urbanecm waves too [13:00:31] but leaves it to Lucas_WMDE :) [13:00:38] ok :) [13:02:15] (03PS2) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768089 [13:03:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Write "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768089 (owner: 10Lucas Werkmeister (WMDE)) [13:04:28] (03Merged) 10jenkins-bot: Write "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768089 (owner: 10Lucas Werkmeister (WMDE)) [13:04:58] testing on mwdebug1001 [13:05:31] looks like it’s working, I’ll sync [13:05:52] PROBLEM - puppet last run on ml-serve1001 is CRITICAL: CRITICAL: Puppet has been disabled for 604975 seconds, message: elukey - cni testing, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:06:20] (03CR) 10Vivian Rook: [C: 03+1] openstack: add neutron-rpc-server and ensure up [puppet] - 10https://gerrit.wikimedia.org/r/771569 (https://phabricator.wikimedia.org/T302369) (owner: 10David Caro) [13:07:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:768089|Write "unexpectedUnconnectedPage" page prop on Beta]] – no expected behavior change in production (1/3) (duration: 00m 53s) [13:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:768089|Write "unexpectedUnconnectedPage" page prop on Beta]] – no expected behavior change in production (2/3) (duration: 00m 49s) [13:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:49] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:768089|Write "unexpectedUnconnectedPage" page prop on Beta]] – no expected behavior change in production (3/3) (duration: 00m 49s) [13:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:01] (03PS2) 10Lucas Werkmeister (WMDE): commonswiki: Add pictures.snsb.info to wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771595 (https://phabricator.wikimedia.org/T303929) (owner: 104nn1l2) [13:12:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] commonswiki: Add pictures.snsb.info to wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771595 (https://phabricator.wikimedia.org/T303929) (owner: 104nn1l2) [13:12:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298557)', diff saved to https://phabricator.wikimedia.org/P22781 and previous config saved to /var/cache/conftool/dbconfig/20220317-131220-marostegui.json [13:12:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:12:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:25] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298557)', diff saved to https://phabricator.wikimedia.org/P22782 and previous config saved to /var/cache/conftool/dbconfig/20220317-131227-marostegui.json [13:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:09] (03Merged) 10jenkins-bot: commonswiki: Add pictures.snsb.info to wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771595 (https://phabricator.wikimedia.org/T303929) (owner: 104nn1l2) [13:13:41] nn1l2: the change is on mwdebug1001, can you test it? [13:13:47] ok [13:14:37] LGTM: https://commons.wikimedia.org/wiki/File:Test_M-0051848_20030327_180450.jpg [13:14:51] ok, thanks [13:16:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:771595|commonswiki: Add pictures.snsb.info to wgCopyUploadsDomains allowlist (T303929)]] (duration: 00m 50s) [13:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:17] T303929: Add pictures.snsb.info to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T303929 [13:17:08] !log UTC afternoon backport window done [13:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:50] (03CR) 10Andrew Bogott: [C: 03+1] openstack: add neutron-rpc-server and ensure up [puppet] - 10https://gerrit.wikimedia.org/r/771569 (https://phabricator.wikimedia.org/T302369) (owner: 10David Caro) [13:18:21] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: add neutron-rpc-server and ensure up [puppet] - 10https://gerrit.wikimedia.org/r/771569 (https://phabricator.wikimedia.org/T302369) (owner: 10David Caro) [13:19:12] (03PS29) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [13:20:44] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:31:14] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:34:48] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:43] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10ayounsi) Sounds good, it's quite quick to apply, but first: ` prometheu... [13:43:50] (03CR) 10Ayounsi: [C: 03+1] Add kubernetes1018-1022 as BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/771564 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [13:43:53] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:21] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:35] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:42] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:44] (03PS29) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [13:46:46] (03PS18) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [13:47:23] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [13:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:36] (03PS1) 10Alexandros Kosiaris: Add kubernetes1018-1022 [puppet] - 10https://gerrit.wikimedia.org/r/771598 (https://phabricator.wikimedia.org/T293728) [14:02:13] 10SRE, 10Traffic: OCSP staple validity alerts/warnings misfire - https://phabricator.wikimedia.org/T304047 (10Vgutierrez) yep.. it's a threshold issue... OCSP warning is triggered mostly at the very same time that acme-chief fetches a new OCSP response from Let's Encrypt. The alert remains active till puppet r... [14:02:16] (03PS1) 10Elukey: Set bullseye + overlay settings for kubernetes10[01][56] nodes [puppet] - 10https://gerrit.wikimedia.org/r/771600 (https://phabricator.wikimedia.org/T300744) [14:02:18] (03PS1) 10Elukey: Set overlay settings for kubernetes1005 [puppet] - 10https://gerrit.wikimedia.org/r/771601 (https://phabricator.wikimedia.org/T300744) [14:02:21] (03PS1) 10Elukey: Set overlay settings for kubernetes1006 [puppet] - 10https://gerrit.wikimedia.org/r/771602 (https://phabricator.wikimedia.org/T300744) [14:02:22] (03PS1) 10Elukey: Set overlay settings for kubernetes1015 [puppet] - 10https://gerrit.wikimedia.org/r/771603 (https://phabricator.wikimedia.org/T300744) [14:02:24] (03PS1) 10Elukey: Set overlay settings for kubernetes1016 [puppet] - 10https://gerrit.wikimedia.org/r/771604 (https://phabricator.wikimedia.org/T300744) [14:03:57] 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10BTullis) Shutting down the two servers now. analytics1063 and analytics1067 [14:05:12] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151 [14:05:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1063.eqiad.wmnet with reason: T303151 [14:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:16] T303151: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 [14:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:20] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151 [14:05:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1067.eqiad.wmnet with reason: T303151 [14:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:58] 10SRE, 10ops-eqiad, 10Data-Engineering: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10Cmjohnson) Thanks @btullis [14:06:01] (03CR) 10Alexandros Kosiaris: "netboot.cfg seems to have been updated already in https://gerrit.wikimedia.org/r/c/operations/puppet/+/766588 and doesn't need anything mo" [puppet] - 10https://gerrit.wikimedia.org/r/771598 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [14:06:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/771564 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [14:06:49] (03Merged) 10jenkins-bot: Add kubernetes1018-1022 as BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/771564 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [14:07:48] (03CR) 10Volans: [C: 04-1] "It surely looks much nicer ad organized. I've found a bug and have some other minor comments, see inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [14:08:15] (03PS1) 10Ayounsi: Add filters to lsw1-e/f1 [homer/public] - 10https://gerrit.wikimedia.org/r/771607 [14:09:01] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10Ottomata) Hm, it flattened out at the max of 2K again, and we are seeing some evictions. I suppose in hindsight this makes sense on j... [14:10:11] (03CR) 10Elukey: [C: 04-1] "kubernetes1017 is probably not meant to be reimaged, in case netboot needs to be updated afaics (only 1018+ have the new recipe)." [puppet] - 10https://gerrit.wikimedia.org/r/771598 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [14:11:02] (03CR) 10Ayounsi: [C: 04-1] "I created I0264b3210e40e55ec2ee672c5a2d0e986b02b522 based on this CR. I think we should apply a loopback filter instead and not an interfa" [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:11:09] (03CR) 10Ayounsi: [C: 03+2] Add filters to lsw1-e/f1 [homer/public] - 10https://gerrit.wikimedia.org/r/771607 (owner: 10Ayounsi) [14:11:50] (03Merged) 10jenkins-bot: Add filters to lsw1-e/f1 [homer/public] - 10https://gerrit.wikimedia.org/r/771607 (owner: 10Ayounsi) [14:11:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298557)', diff saved to https://phabricator.wikimedia.org/P22783 and previous config saved to /var/cache/conftool/dbconfig/20220317-141152-marostegui.json [14:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:58] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:13:21] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlay settings for kubernetes10[01][56] nodes [puppet] - 10https://gerrit.wikimedia.org/r/771600 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:13:28] (03CR) 10Ayounsi: "Confirmed it's now a NOOP on homer runs." [homer/public] - 10https://gerrit.wikimedia.org/r/771607 (owner: 10Ayounsi) [14:13:38] (03PS2) 10Elukey: Set bullseye + overlay settings for kubernetes10[01][56] nodes [puppet] - 10https://gerrit.wikimedia.org/r/771600 (https://phabricator.wikimedia.org/T300744) [14:13:41] (03PS2) 10Elukey: Set overlay settings for kubernetes1005 [puppet] - 10https://gerrit.wikimedia.org/r/771601 (https://phabricator.wikimedia.org/T300744) [14:13:42] (03PS2) 10Elukey: Set overlay settings for kubernetes1006 [puppet] - 10https://gerrit.wikimedia.org/r/771602 (https://phabricator.wikimedia.org/T300744) [14:13:45] (03PS2) 10Elukey: Set overlay settings for kubernetes1015 [puppet] - 10https://gerrit.wikimedia.org/r/771603 (https://phabricator.wikimedia.org/T300744) [14:13:47] (03PS2) 10Elukey: Set overlay settings for kubernetes1016 [puppet] - 10https://gerrit.wikimedia.org/r/771604 (https://phabricator.wikimedia.org/T300744) [14:13:53] (03CR) 10JMeybohm: [C: 03+1] Set overlay settings for kubernetes1005 [puppet] - 10https://gerrit.wikimedia.org/r/771601 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:14:05] (03CR) 10JMeybohm: [C: 03+1] Set overlay settings for kubernetes1006 [puppet] - 10https://gerrit.wikimedia.org/r/771602 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:14:18] (03CR) 10JMeybohm: [C: 03+1] Set overlay settings for kubernetes1015 [puppet] - 10https://gerrit.wikimedia.org/r/771603 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:14:32] (03CR) 10JMeybohm: [C: 03+1] Set overlay settings for kubernetes1016 [puppet] - 10https://gerrit.wikimedia.org/r/771604 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:14:46] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlay settings for kubernetes10[01][56] nodes [puppet] - 10https://gerrit.wikimedia.org/r/771600 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:14:54] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) a:05Ottomata→03None [14:15:42] !next [14:15:55] jouncebot next [14:15:55] In 1 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1600) [14:16:06] (03PS1) 10Ssingh: P:icinga: add profile for performance tweaking [puppet] - 10https://gerrit.wikimedia.org/r/771610 [14:18:42] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10fgiunchedi) For sure, these are the IPs that the pushgateway could point... [14:18:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34397/console" [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [14:19:53] (03PS2) 10Ssingh: P:icinga: add profile for performance tweaking [puppet] - 10https://gerrit.wikimedia.org/r/771610 [14:20:56] (03CR) 10Ssingh: "PCC is valid; the currrent patchset has just the commit message updated." [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [14:21:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Idea LGTM (can't meaningfully comment whether this will work or not, but it should!)" [puppet] - 10https://gerrit.wikimedia.org/r/771551 (https://phabricator.wikimedia.org/T95705) (owner: 10Jcrespo) [14:21:38] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) >>! In T297913#7782842, @RobH wrote: > @MoritzMuehlenhoff , do you happen to know a fix for the above partman issue or do I need to escalate to a wider part of SRE? I don't think ne... [14:21:58] PROBLEM - SSH on kubernetes2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:22:22] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:22:47] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) >>! In T297913#7782921, @RobH wrote: > Echo of my testing so far: > > setting the drive info via show and setting it to on or offline works, but not setting to missing or sending rebu... [14:25:25] (03CR) 10Alexandros Kosiaris: Add kubernetes1018-1022 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771598 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [14:25:49] (03PS1) 10Ayounsi: Allow traffic from Analytics to prometheus hosts [homer/public] - 10https://gerrit.wikimedia.org/r/771612 (https://phabricator.wikimedia.org/T304001) [14:26:06] (03PS2) 10Alexandros Kosiaris: Add kubernetes1018-1022 [puppet] - 10https://gerrit.wikimedia.org/r/771598 (https://phabricator.wikimedia.org/T293728) [14:26:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P22784 and previous config saved to /var/cache/conftool/dbconfig/20220317-142658-marostegui.json [14:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:20] (03CR) 10Ayounsi: [C: 03+2] Allow traffic from Analytics to prometheus hosts [homer/public] - 10https://gerrit.wikimedia.org/r/771612 (https://phabricator.wikimedia.org/T304001) (owner: 10Ayounsi) [14:27:52] (03Merged) 10jenkins-bot: Allow traffic from Analytics to prometheus hosts [homer/public] - 10https://gerrit.wikimedia.org/r/771612 (https://phabricator.wikimedia.org/T304001) (owner: 10Ayounsi) [14:31:54] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [14:32:41] 10SRE, 10Continuous-Integration-Infrastructure: jenkins / zuul backing up due to jenkins slaves down - https://phabricator.wikimedia.org/T216039 (10hashar) [14:32:58] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:34:51] (03PS1) 10Ottomata: gobblin - Use new gobblin-wmf-core jar in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) [14:36:15] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Data-Engineering-Kanban: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10BTullis) p:05Triage→03Medium [14:37:48] (03PS2) 10Ottomata: gobblin - Use new gobblin-wmf-core jar in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) [14:38:18] (03PS3) 10Ottomata: gobblin - Use new gobblin-wmf-core jar in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) [14:39:30] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34398/console" [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [14:39:53] (03CR) 10Joal: [C: 04-1] "Copy/paste error (I think)" [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [14:41:25] Can anyone remember exactly how SiteConfiguration sets $lang for commonswiki to be 'en'? I used to know this but clearly have forgotten. [14:42:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P22785 and previous config saved to /var/cache/conftool/dbconfig/20220317-144203-marostegui.json [14:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:33] (03PS4) 10Ottomata: gobblin - Use new gobblin-wmf-core jar in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) [14:43:23] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Data-Engineering-Kanban: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10Cmjohnson) 05Open→03Resolved I am able to get into the idrac for both servers, it does take a little longer than normal. I... [14:43:55] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34399/console" [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [14:44:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10Cmjohnson) @dcaro I want to do this now, will that be okay? [14:46:15] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [14:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] (03CR) 10Elukey: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/771598 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [14:47:41] (03PS3) 10MSantos: WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 [14:48:30] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10JAllemandou) I'm not sure either @Ottomata - I have seen warnings about cache-session evictions, and thought it could be related. My la... [14:48:39] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Data-Engineering-Kanban: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10BTullis) Thanks @Cmjohnson - I guess we'll just keep monitoring for stability and reopen this ticket if it keeps happening. [14:48:54] (03CR) 10jerkins-bot: [V: 04-1] WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 (owner: 10MSantos) [14:50:37] (03CR) 10Ottomata: [V: 03+1 C: 03+2] gobblin - Use new gobblin-wmf-core jar in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771614 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [14:56:09] (03PS4) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [14:56:24] (03CR) 10JHathaway: [C: 03+1] "looks good to me" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/771483 (owner: 10Jbond) [14:56:48] (03CR) 10Bking: elasticsearch: remove custom restart handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [14:57:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298557)', diff saved to https://phabricator.wikimedia.org/P22788 and previous config saved to /var/cache/conftool/dbconfig/20220317-145708-marostegui.json [14:57:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [14:57:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [14:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:13] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298557)', diff saved to https://phabricator.wikimedia.org/P22789 and previous config saved to /var/cache/conftool/dbconfig/20220317-145716-marostegui.json [14:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/771598 (https://phabricator.wikimedia.org/T293728) (owner: 10Alexandros Kosiaris) [14:57:51] (03CR) 10Jcrespo: [C: 03+1] "I will deploy, and observe the change applies as documented (not the first time documentation != relality for bacula), then resolve the ti" [puppet] - 10https://gerrit.wikimedia.org/r/771551 (https://phabricator.wikimedia.org/T95705) (owner: 10Jcrespo) [14:59:35] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [15:00:11] (03PS5) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [15:00:13] (03PS1) 10Jforrester: [BETA CLUSTER] Add wikifunctions to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771619 (https://phabricator.wikimedia.org/T300911) [15:00:15] (03PS1) 10Jforrester: Allow wikifunctions.org URLs to be used in the URL Shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771620 [15:00:17] (03PS1) 10Jforrester: Allow wikifunctions.org to use the CAPTCHA system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771621 [15:00:19] (03PS1) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) [15:00:21] (03PS1) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [15:00:23] (03PS1) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [15:00:26] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1018.eqiad.wmnet with OS bullseye [15:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:44] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1019.eqiad.wmnet with OS bullseye [15:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:12] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Cmjohnson) Disk Order has been submitted with Dell You have successfully submitted request SR1087559166. [15:04:58] (03PS5) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [15:06:54] James_F, I always thought $lang for commonswiki is just 'commons' and then in IS wgLanguageCode gets set to en due to commonswiki being in special.dblist [15:07:06] !log disable BGP to Telia in codfw for fiber move - T289241 [15:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:40] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [15:08:24] (03CR) 10Jbond: [C: 03+1] "lgtm minor comment/question" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [15:11:06] zabe: Hmm. Maybe? [15:11:21] That would make sense, actually, yeah. I'll dig some more. [15:13:04] (03PS8) 10DCausse: [wdqs] test jvmquake options on the public cluster [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) [15:13:57] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] test jvmquake options on the public cluster [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [15:16:10] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:17:26] (KubernetesCalicoDown) firing: kubernetes1021.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:18:00] (03PS6) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [15:18:26] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1021:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [15:18:39] (03PS1) 10Ottomata: gobblin - Revert gobblin jar on hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771625 (https://phabricator.wikimedia.org/T297939) [15:18:54] (03PS2) 10Ottomata: gobblin - Revert gobblin jar on hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771625 (https://phabricator.wikimedia.org/T297939) [15:18:57] (03PS6) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [15:18:59] (03CR) 10jerkins-bot: [V: 04-1] gobblin - Revert gobblin jar on hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771625 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [15:19:26] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1021:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [15:19:41] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:20:08] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Idle - Telia, AS1299/IPv4: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:29] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [15:23:24] (03PS7) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [15:24:04] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:25:10] jouncebot: !nowandnext [15:25:38] jouncebot: now [15:25:38] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [15:25:42] jouncebot: next [15:25:42] In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1600) [15:25:51] OK, I'll sling out a couple of BC patches. [15:25:58] (03PS2) 10Jforrester: Allow wikifunctions.org URLs to be used in the URL Shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771620 [15:26:00] (03PS2) 10Jforrester: Allow wikifunctions.org to use the CAPTCHA system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771621 [15:26:04] (03PS2) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) [15:26:06] (03PS2) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [15:26:08] (03PS2) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [15:26:10] (03PS6) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [15:26:12] (03PS1) 10Jforrester: [BETA CLUSTER] Set wikifunctions's wgLanguageCode as it's not in special yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771627 (https://phabricator.wikimedia.org/T297329) [15:26:14] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Add wikifunctions to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771619 (https://phabricator.wikimedia.org/T300911) (owner: 10Jforrester) [15:27:56] (03Merged) 10jenkins-bot: [BETA CLUSTER] Add wikifunctions to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771619 (https://phabricator.wikimedia.org/T300911) (owner: 10Jforrester) [15:28:08] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Set wikifunctions's wgLanguageCode as it's not in special yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771627 (https://phabricator.wikimedia.org/T297329) (owner: 10Jforrester) [15:28:26] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes1021:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [15:28:27] (03PS8) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [15:28:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Cmjohnson) @akosiaris I can spread the other 3 between B and D if that works better for you? [15:28:49] (03Merged) 10jenkins-bot: [BETA CLUSTER] Set wikifunctions's wgLanguageCode as it's not in special yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771627 (https://phabricator.wikimedia.org/T297329) (owner: 10Jforrester) [15:29:02] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:30:47] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1020.eqiad.wmnet with OS bullseye [15:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:02] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1021.eqiad.wmnet with OS bullseye [15:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10dcaro) @Cmjohnson yes, thanks! [15:31:49] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1022.eqiad.wmnet with OS bullseye [15:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:00] (03PS9) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [15:32:21] (03CR) 10Ottomata: [C: 03+2] gobblin - Revert gobblin jar on hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/771625 (https://phabricator.wikimedia.org/T297939) (owner: 10Ottomata) [15:33:32] (03PS30) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [15:33:54] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:34:28] !log restarting FPM on mw canaries [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:37] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:35:34] (03PS1) 10Majavah: Fixes to run on bullseye [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/771628 (https://phabricator.wikimedia.org/T302178) [15:37:26] (KubernetesCalicoDown) firing: (3) kubernetes1020.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:37:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:17] PROBLEM - Host cloudcephmon1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:42:29] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:33] (03PS1) 10Urbanecm: ptwiki: Disable Growth's image recommendation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771629 (https://phabricator.wikimedia.org/T302828) [15:44:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:45:26] (03CR) 10Herron: [C: 03+2] watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [15:45:31] (03PS3) 10Herron: watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) [15:46:25] (03Abandoned) 10Bernard Wang: Fix updateUserLinksDropdownItems not being called [skins/Vector] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771395 (https://phabricator.wikimedia.org/T304002) (owner: 10Jdlrobson) [15:46:28] !log cr1-codfw move xe-5/2/0 to xe-1/0/1:1 - T289241 [15:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:26] (KubernetesCalicoDown) firing: (3) kubernetes1020.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:47:36] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10BBlack) 05Open→03Resolved With the addition of the drmrs to the dns config in https://gerrit.wikimedia.org/r/c/operations/dns/+/771342 we're basically done w... [15:47:41] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10BBlack) [15:47:50] (03PS4) 10Huji: Increase AbuseFilter's emergency disable threshold for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) [15:48:37] jouncebot: nowandnext [15:48:37] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [15:48:38] In 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1600) [15:49:16] urbanecm: feel free to dip into the puppet window if you need to :) [15:49:42] thank you -- I've an urgent change to make, just discussing with other Growth folks about what exactly disabling means (to not do more harm than we fix). [15:49:48] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:03] (03CR) 10Urbanecm: [C: 03+2] ptwiki: Disable Growth's image recommendation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771629 (https://phabricator.wikimedia.org/T302828) (owner: 10Urbanecm) [15:50:51] (03Merged) 10jenkins-bot: ptwiki: Disable Growth's image recommendation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771629 (https://phabricator.wikimedia.org/T302828) (owner: 10Urbanecm) [15:52:26] (KubernetesCalicoDown) firing: (3) kubernetes1020.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:52:38] RECOVERY - SSH on kubernetes2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:52:50] (03CR) 10Herron: watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [15:52:56] (03CR) 10Herron: [C: 03+2] watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [15:54:32] (03PS1) 10Giuseppe Lavagetto: conftool-data: add a rule for cache-upload [puppet] - 10https://gerrit.wikimedia.org/r/771630 [15:55:12] (03Merged) 10jenkins-bot: watchrat: require 3+ sites to agree on error status before alerting [alerts] - 10https://gerrit.wikimedia.org/r/771009 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [15:55:48] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:56:22] (03Abandoned) 10Giuseppe Lavagetto: conftool-data: add a rule for cache-upload [puppet] - 10https://gerrit.wikimedia.org/r/771630 (owner: 10Giuseppe Lavagetto) [15:57:13] (03CR) 10Jelto: [C: 03+1] "looks good. Thank you for the heads-up!" [puppet] - 10https://gerrit.wikimedia.org/r/771362 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:57:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298557)', diff saved to https://phabricator.wikimedia.org/P22790 and previous config saved to /var/cache/conftool/dbconfig/20220317-155713-marostegui.json [15:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:18] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:57:26] (KubernetesCalicoDown) firing: (3) kubernetes1020.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:57:29] (03CR) 10Andrew Bogott: [C: 03+1] "thank you taavi!" [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/771628 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [15:57:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 60980ce85c080fadaf0b2cb561be53f861ca94e0: ptwiki: Disable Growth image recommendation (T302828) (duration: 00m 53s) [15:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:37] T302828: Scale: deploy "add an image" to pt, fa, fr, tr - https://phabricator.wikimedia.org/T302828 [15:57:40] (03CR) 10David Caro: "Did you try this?" [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/771628 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [15:57:47] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34400/console" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [15:58:33] (03CR) 10Cwhite: [C: 03+2] opensearch: use separate rundir per instance [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [16:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:29] no puppet window today, urbanecm still has the floor [16:00:37] !log restarting apache on logstash* [16:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:41] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:58] * urbanecm gives up the floor, we (Growth) are satisfied the issue was fixed now [16:01:13] well, no puppet window anyway :D [16:01:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] varnish/frontend: consume etcd data for dynamic banning of requests. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [16:01:53] unless anyone comes bursting through the door at the last minute, out of breath from running all the way here, dramatically clutching an unmerged puppet patch [16:02:01] (03CR) 10David Caro: "I think the tests failed because of:" [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/771628 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [16:02:20] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) [16:02:24] (03PS2) 10BBlack: GeoDNS Cyprus to drmrs [dns] - 10https://gerrit.wikimedia.org/r/771354 (https://phabricator.wikimedia.org/T304089) (owner: 10Ayounsi) [16:02:26] (03PS1) 10BBlack: geodns: remove geo-maps-esams-offline hack [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) [16:02:26] (KubernetesCalicoDown) resolved: (3) kubernetes1020.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:02:28] (03PS1) 10BBlack: geodns: add drmrs fallback for esams to whole map [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) [16:03:03] !log [WDQS] Pooled `wdqs2003` (caught up on lag) [16:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:09] !log [WDQS] `ryankemper@wdqs2001:~$ sudo systemctl restart wdqs-blazegraph.service` [16:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:11] (03PS1) 10Jelto: gitlab_runner: remove duplicate ferm rule for AAAA [puppet] - 10https://gerrit.wikimedia.org/r/771633 (https://phabricator.wikimedia.org/T295481) [16:04:35] !log [WDQS] Depooled `wdqs2001` (~4.85 hours of lag to catch up) [16:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:47] (03CR) 10Jelto: gitlab_runner: restrict docker traffic with additional ferm rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769968 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:05:08] !log oblivian@puppetmaster1001 conftool action : edit; selector: name=random_q [16:05:08] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34401/console" [puppet] - 10https://gerrit.wikimedia.org/r/771633 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:06:50] (03CR) 10Ahmon Dancy: static.php: Fold "current" handling into "nohash" and extend TTL to 1y (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [16:07:26] (03PS2) 10Jelto: gitlab_runner: remove duplicate ferm rule for AAAA [puppet] - 10https://gerrit.wikimedia.org/r/771633 (https://phabricator.wikimedia.org/T295481) [16:08:56] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34402/console" [puppet] - 10https://gerrit.wikimedia.org/r/771633 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:10:41] !log pfw3-codfw move traffic to cr2 uplink [16:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:44] (03CR) 10Volans: "question inline" [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [16:11:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P22792 and previous config saved to /var/cache/conftool/dbconfig/20220317-161218-marostegui.json [16:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:00] (03CR) 10BBlack: geodns: add drmrs fallback for esams to whole map (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [16:15:00] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:28] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10Ottomata) Okay, no prob! I'm inclined to just keep slots set at 2000; it is probably nice to have a little more room in jumbo anyway?... [16:15:32] (03Restored) 10Bernard Wang: Fix updateUserLinksDropdownItems not being called [skins/Vector] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771395 (https://phabricator.wikimedia.org/T304002) (owner: 10Jdlrobson) [16:16:03] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10Aklapper) Half a year later, is there more to do here? Anything I could (maybe) help with? [16:16:34] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/771389 (owner: 10Milimetric) [16:17:57] (03PS2) 10Krinkle: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) [16:24:06] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BBlack) Lots left to do here, we've just been pummeled by several layers of ever-increasing high-priority things that take precedence over each other. What we're blocked on here is making time to do the tri... [16:26:41] 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10herron) [16:26:45] 10SRE, 10ops-eqiad, 10DC-Ops: Q3: install 2 new HDD into centrallog1001 - https://phabricator.wikimedia.org/T302437 (10herron) 05Open→03Resolved Disks have been added and the volume group on the host has been grown. Thanks @Jclark-ctr! [16:27:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P22793 and previous config saved to /var/cache/conftool/dbconfig/20220317-162723-marostegui.json [16:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:07] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:26] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:31:22] !log sudo service networking restart on puppetmaster1003 [16:31:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:58] PROBLEM - Host puppetmaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:34] RECOVERY - Host puppetmaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [16:34:06] !log [WDQS] Pooled `wdqs2001` (caught up on lag) [16:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:12] (03PS22) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [16:35:19] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache cloudcephmon1003.eqiad.wmnet on all recursors [16:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephmon1003.eqiad.wmnet on all recursors [16:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:28] !log dcaro@cumin1001 START - Cookbook sre.dns.wipe-cache cloudcephmon1003.eqiad.wmnet on all recursors [16:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:31] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephmon1003.eqiad.wmnet on all recursors [16:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:12] !log restarting LDAP replicas for openssl update [16:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Fixes to run on bullseye [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/771628 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [16:37:50] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [16:37:55] (03CR) 10Razzi: Add cookbooks for running maintain-views (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [16:39:46] (03PS23) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [16:40:00] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:23] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [16:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:28] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [16:42:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298557)', diff saved to https://phabricator.wikimedia.org/P22794 and previous config saved to /var/cache/conftool/dbconfig/20220317-164228-marostegui.json [16:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:33] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [16:43:15] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [16:46:50] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:46:58] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:47:20] (03PS24) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [16:48:27] (03CR) 10JMeybohm: [C: 03+1] kubernetes: Upgrade default envoy version to 1.18.3 [puppet] - 10https://gerrit.wikimedia.org/r/771053 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [16:48:55] !log disable BGP to Lumen in codfw for fiber move [16:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:19] (03PS19) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [16:49:32] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:50:14] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [16:51:59] (03PS1) 10Jbond: idp: Open up orchestrator to cumin host, take III [puppet] - 10https://gerrit.wikimedia.org/r/771642 (https://phabricator.wikimedia.org/T281249) [16:52:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34403/console" [puppet] - 10https://gerrit.wikimedia.org/r/771642 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [16:53:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) dumpsdata1006 is now ready for install for partman testing, but its failing dhcp. I see it hit dhcp on install1003 and send back info, but the host t... [17:00:14] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jbond) @Ladsgroup i think we have two separate issues here. the first... [17:00:30] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:33] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:01] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:22] PROBLEM - Confd vcl based reload on cp5006 is CRITICAL: reload-vcl failed to run since 0h, 8 minutes. https://wikitech.wikimedia.org/wiki/Varnish [17:02:50] PROBLEM - Confd vcl based reload on cp5011 is CRITICAL: reload-vcl failed to run since 0h, 8 minutes. https://wikitech.wikimedia.org/wiki/Varnish [17:03:28] !log restart atftp on install1003 [17:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:27] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1018.eqiad.wmnet with OS bullseye [17:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:52] (03CR) 10BBlack: [C: 03+2] GeoDNS Cyprus to drmrs [dns] - 10https://gerrit.wikimedia.org/r/771354 (https://phabricator.wikimedia.org/T304089) (owner: 10Ayounsi) [17:05:19] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1018.eqiad.wmnet with OS bullseye [17:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:24] RECOVERY - TFTP service on install1003 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [17:06:05] (03PS2) 10Cwhite: logstash: re-enable service restart on config changes [puppet] - 10https://gerrit.wikimedia.org/r/767836 (https://phabricator.wikimedia.org/T254533) [17:06:12] RECOVERY - Check systemd state on install1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:23] !log geodns - Cyprus routed to new drmrs edge DC (first live users!) - will phase in over the standard 10 minute DNS TTL [17:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:57] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1019.eqiad.wmnet with OS bullseye [17:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:14] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1019.eqiad.wmnet with OS bullseye [17:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:15] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1021.eqiad.wmnet with OS bullseye [17:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:34] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1021.eqiad.wmnet with OS bullseye [17:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:47] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1022.eqiad.wmnet with OS bullseye [17:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:00] RECOVERY - Host cloudcephmon1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [17:09:00] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1022.eqiad.wmnet with OS bullseye [17:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:36] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1020.eqiad.wmnet with OS bullseye [17:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:49] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1020.eqiad.wmnet with OS bullseye [17:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:52] (03CR) 10Cwhite: [C: 03+2] logstash: re-enable service restart on config changes [puppet] - 10https://gerrit.wikimedia.org/r/767836 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite) [17:10:16] (03CR) 10Ottomata: [C: 03+2] Revert "Temporarily disable traffic data purge" [puppet] - 10https://gerrit.wikimedia.org/r/771389 (owner: 10Milimetric) [17:10:32] (03CR) 10Jcrespo: "You can ignore then my mysql_root_clients suggestion- that could still be used by mysql grants/firewalls, but use something else for web a" [puppet] - 10https://gerrit.wikimedia.org/r/771642 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [17:10:34] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:45] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [17:11:15] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:23] (03CR) 10Bking: [C: 03+2] team-search-platform: relax RdfStreamingUpdaterFlinkProcessingLatencyIsHigh [alerts] - 10https://gerrit.wikimedia.org/r/770982 (owner: 10DCausse) [17:12:17] jouncebot nowandnext [17:12:17] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [17:12:17] In 0 hour(s) and 47 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1800) [17:12:28] I'm retesting image building. [17:12:32] Should be quick [17:15:05] !log dancy@deploy1002 Synchronized README: testing mediawiki image build (duration: 02m 11s) [17:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:03] (03CR) 10Jbond: "just added some comments from the irc chat for prosperity" [puppet] - 10https://gerrit.wikimedia.org/r/763611 (https://phabricator.wikimedia.org/T265138) (owner: 10JHathaway) [17:16:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:58] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: host reimage [17:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:29] (03CR) 10Bking: [C: 03+2] [wdqs] cleanup the updater setup logic [puppet] - 10https://gerrit.wikimedia.org/r/770951 (https://phabricator.wikimedia.org/T301108) (owner: 10DCausse) [17:18:03] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:07] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:18:09] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:56] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1019.eqiad.wmnet with reason: host reimage [17:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:03] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1021.eqiad.wmnet with reason: host reimage [17:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:20] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: host reimage [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/771633 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:20:27] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1022.eqiad.wmnet with reason: host reimage [17:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:48] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:13] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:17] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:21:18] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1020.eqiad.wmnet with reason: host reimage [17:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:25] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:28] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:20] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1020.eqiad.wmnet with reason: host reimage [17:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:01] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1019.eqiad.wmnet with reason: host reimage [17:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:30] (03CR) 10Jbond: [C: 03+2] P:environment: Add support for environment.d to zsh and bash [puppet] - 10https://gerrit.wikimedia.org/r/771568 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [17:24:03] (03CR) 10Jbond: [C: 03+2] systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond) [17:24:23] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:27] (03CR) 10Jbond: [C: 03+2] systemd: Add new define to manage user service environments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond) [17:24:44] (03PS7) 10Jbond: systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 [17:24:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] systemd: Add new define to manage user service environments [puppet] - 10https://gerrit.wikimedia.org/r/771410 (owner: 10Jbond) [17:24:58] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:02] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:25:03] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1021.eqiad.wmnet with reason: host reimage [17:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:10] (03CR) 10Bking: [C: 03+2] cirrus: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/771076 (https://phabricator.wikimedia.org/T302733) (owner: 10Ebernhardson) [17:25:39] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1022.eqiad.wmnet with reason: host reimage [17:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:59] inflatador: happy for me to merge your change [17:26:14] jbond thank you sir, was just about to ask ;) [17:27:15] inflatador: done [17:27:32] cool, it's the weekly "merge puppet patches for devs" mtg [17:27:35] (03CR) 10Bking: [C: 03+2] prometheus: restart elastic exporter on code change [puppet] - 10https://gerrit.wikimedia.org/r/769783 (owner: 10Ebernhardson) [17:27:47] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:08] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: sync [17:28:10] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync [17:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:27] inflatador: fyi you dont need require if you have subscirbe [17:28:30] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:35] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [17:28:38] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:45] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:28:47] uh oh [17:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:49] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:28:58] jbond ah OK, might be another patch shortly then ;) [17:29:13] its not an issue just a heads up [17:29:46] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:04] ^ this was certspotter, should be fixed [17:30:30] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1018.eqiad.wmnet with OS bullseye [17:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:48] PROBLEM - Host kubernetes1020 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:00] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:31:58] RECOVERY - Host kubernetes1020 is UP: PING OK - Packet loss = 0%, RTA = 2.55 ms [17:32:09] (03PS3) 10Jbond: P:environment: enable export_systemd_env in cloud [puppet] - 10https://gerrit.wikimedia.org/r/771576 [17:32:30] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:32:47] (03PS4) 10Jbond: P:environment: Add no_proxy values to the default environment [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [17:33:04] (03PS4) 10Jbond: P:java: update profile::java to use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/771415 [17:33:26] (KubernetesCalicoDown) firing: kubernetes1022.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:33:43] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1020.eqiad.wmnet with OS bullseye [17:33:44] PROBLEM - Host kubernetes1021 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:24] RECOVERY - Host kubernetes1021 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:34:46] !log dcaro@cumin1001 START - Cookbook sre.dns.wipe-cache cloudcephmon1003.eqiad.wmnet on all recursors [17:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:49] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephmon1003.eqiad.wmnet on all recursors [17:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:24] (03PS3) 10Aaron Schulz: rdbms: fix owner id and RELEASE_ALL_LOCKS query in session flushing methods [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (https://phabricator.wikimedia.org/T292239) [17:35:33] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1019.eqiad.wmnet with OS bullseye [17:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10Cmjohnson) [17:36:41] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1021.eqiad.wmnet with OS bullseye [17:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:31] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1022.eqiad.wmnet with OS bullseye [17:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10Cmjohnson) 05Open→03Resolved Made this more complicated than it needed to be, I didn't realize that IP would not... [17:38:26] (KubernetesCalicoDown) resolved: (2) kubernetes1021.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:41:27] !log uploaded prometheus-openstack-exporter 0.0.8-4~wmf1 to bullseye-wikimedia (T302178) [17:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:31] T302178: prometheus-openstack-exporter No module named 'urlparse' - https://phabricator.wikimedia.org/T302178 [17:41:43] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:05] (03PS1) 10Elukey: WIP - initial debianization [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 [17:45:46] (03CR) 10Volans: [C: 03+1] "LGTM for start testing it." [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [17:45:56] (03PS1) 10Jbond: P:environment: fix systemd-environment variable injection for bash [puppet] - 10https://gerrit.wikimedia.org/r/771671 [17:46:30] (03CR) 10Razzi: [C: 03+2] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [17:46:32] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:30] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1006.eqiad.wmnet with OS bullseye [17:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:33] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [17:48:14] (03CR) 10Jeena Huneidi: "Hi, I see there are some new changes to this patch. Just wondering if we should make a new cherry pick of the original patch since it had " [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (https://phabricator.wikimedia.org/T292239) (owner: 10Aaron Schulz) [17:48:35] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/771671 (owner: 10Jbond) [17:48:57] (03CR) 10Jbond: [C: 03+2] P:environment: fix systemd-environment variable injection for bash [puppet] - 10https://gerrit.wikimedia.org/r/771671 (owner: 10Jbond) [17:50:03] (03Merged) 10jenkins-bot: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [17:50:37] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:03] (03PS1) 10BBlack: geodns: maxmind now has CY in EU rather than AS [dns] - 10https://gerrit.wikimedia.org/r/771672 (https://phabricator.wikimedia.org/T304089) [17:52:33] (03CR) 10BBlack: [C: 03+2] geodns: maxmind now has CY in EU rather than AS [dns] - 10https://gerrit.wikimedia.org/r/771672 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [17:53:15] (03PS1) 10Jbond: P:environment: actully set the content [puppet] - 10https://gerrit.wikimedia.org/r/771673 [17:53:38] (03CR) 10Ayounsi: [C: 03+1] geodns: maxmind now has CY in EU rather than AS [dns] - 10https://gerrit.wikimedia.org/r/771672 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [17:54:39] (03CR) 10jerkins-bot: [V: 04-1] rdbms: fix owner id and RELEASE_ALL_LOCKS query in session flushing methods [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (https://phabricator.wikimedia.org/T292239) (owner: 10Aaron Schulz) [17:55:00] (03CR) 10Jbond: [C: 03+2] P:environment: actully set the content [puppet] - 10https://gerrit.wikimedia.org/r/771673 (owner: 10Jbond) [17:55:31] (03PS1) 10Dwisehaupt: Add public dns entry for civi1002 [dns] - 10https://gerrit.wikimedia.org/r/771675 (https://phabricator.wikimedia.org/T296409) [17:56:10] (03PS1) 10Elukey: install_server: use the new flat-noswap recipe for k8s masters [puppet] - 10https://gerrit.wikimedia.org/r/771676 (https://phabricator.wikimedia.org/T299634) [17:56:18] (03PS2) 10Dwisehaupt: Add public dns entry for civi1002 [dns] - 10https://gerrit.wikimedia.org/r/771675 (https://phabricator.wikimedia.org/T296409) [17:56:42] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:56] (03PS1) 10Jbond: fix profile [puppet] - 10https://gerrit.wikimedia.org/r/771677 [17:58:08] (03CR) 10Jbond: [V: 03+2 C: 03+2] fix profile [puppet] - 10https://gerrit.wikimedia.org/r/771677 (owner: 10Jbond) [17:58:35] (03CR) 10Jgreen: [C: 03+2] Add public dns entry for civi1002 [dns] - 10https://gerrit.wikimedia.org/r/771675 (https://phabricator.wikimedia.org/T296409) (owner: 10Dwisehaupt) [18:00:05] jeena and dancy: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T1800). [18:00:28] Train is still blocked, but I'll do a backport for https://phabricator.wikimedia.org/T304002 [18:00:44] (03CR) 10Zabe: geodns: remove geo-maps-esams-offline hack (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [18:01:06] (03CR) 10JMeybohm: [C: 03+1] install_server: use the new flat-noswap recipe for k8s masters [puppet] - 10https://gerrit.wikimedia.org/r/771676 (https://phabricator.wikimedia.org/T299634) (owner: 10Elukey) [18:01:08] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:13] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:04:38] (03PS2) 10BBlack: geodns: remove geo-maps-esams-offline hack [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) [18:04:40] (03PS2) 10BBlack: geodns: add drmrs fallback for esams to whole map [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) [18:07:17] (03CR) 10Elukey: [C: 03+2] install_server: use the new flat-noswap recipe for k8s masters [puppet] - 10https://gerrit.wikimedia.org/r/771676 (https://phabricator.wikimedia.org/T299634) (owner: 10Elukey) [18:10:43] (03PS21) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [18:11:06] (03PS22) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [18:12:04] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [18:12:48] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:53] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed from Puppet and PuppetD... [18:14:53] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [skins/Vector] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771395 (https://phabricator.wikimedia.org/T304002) (owner: 10Jdlrobson) [18:18:25] !log cordon kubernetes10{18..22} T293728 [18:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:29] T293728: setup/install kubernetes10[18-22] - https://phabricator.wikimedia.org/T293728 [18:18:33] hey, i logged in to mwlog1002 and a bunch of declares showed up. Why does that happen? [18:19:00] looks like this (and a lot of other declares) https://www.irccloud.com/pastebin/CfkuLlvt/ [18:20:33] urbanecm: see -sre [18:20:40] thanks [18:22:01] (03PS1) 10RobH: testing new partman recipe for h750 [puppet] - 10https://gerrit.wikimedia.org/r/771679 (https://phabricator.wikimedia.org/T302937) [18:22:40] (03CR) 10RobH: [C: 03+2] testing new partman recipe for h750 [puppet] - 10https://gerrit.wikimedia.org/r/771679 (https://phabricator.wikimedia.org/T302937) (owner: 10RobH) [18:23:47] urbanecm: i see same on puppetmaster1001 [18:23:58] i think someone must have set a bad fleetwide variable meaning to do so perhaps to a few hosts? [18:24:09] maybe [18:24:13] nm, i see above about other channel, i check there [18:24:36] fix has been deployed by others [18:24:43] should be out to all hosts by 30 past [18:24:49] (y) [18:25:10] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite on `logging-logstash-01.logging.eqiad1.wikimedia.cloud` there is /etc/ssl/localcerts/wmf-java-cacerts, a jks that should contain the two Root CA certs t... [18:27:01] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:07] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:29:12] (03Merged) 10jenkins-bot: Fix updateUserLinksDropdownItems not being called [skins/Vector] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771395 (https://phabricator.wikimedia.org/T304002) (owner: 10Jdlrobson) [18:30:04] jeena: I'm reverting the whole chain [18:30:45] Amir1: got it. Is it okay for me to continue with a backport for https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/771395? cc Jdlrobson [18:30:52] sure [18:30:59] I have to be in a meeting rn [18:30:59] (03CR) 10Umherirrender: "This does not help, try to use I0828257d6dd0bbc5b1633afde5ff162e96169675 instead" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771398 (https://phabricator.wikimedia.org/T292239) (owner: 10Hashar) [18:31:03] ok thanks [18:35:08] Jdlrobson: The changes should be on mwdebug whenever you are ready to test [18:36:49] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:37:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:38:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] jeena: testing sorry [18:40:01] np just wanted to make sure you knew [18:40:41] jeena: which debug server? [18:40:55] 1001 [18:41:11] cool seeing it fixed there [18:41:23] but not 1002 [18:41:44] That's expected right? [18:44:43] If all is well I will go ahead and sync [18:51:07] ok...syncing now [18:51:45] yep sorry for not being clear [18:51:49] thanks jeena [18:51:55] :D thanks for testing! [18:53:06] !log jhuneidi@deploy1002 Synchronized php-1.38.0-wmf.26/skins/Vector/includes/Hooks.php: Backport: [[gerrit:771395|Fix updateUserLinksDropdownItems not being called (T304002)]] (duration: 00m 50s) [18:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:10] T304002: Some mediawiki pages showing duplicate login + unstyled login status in user menu - https://phabricator.wikimedia.org/T304002 [18:55:42] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:49] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1006 (**FAIL**) - Removed f... [19:01:00] (03CR) 10Cathal Mooney: Add ACL filter to Spine switch interface connecting CR routers Eqiad (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:04:39] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) >>! In T297913#7785685, @MoritzMuehlenhoff wrote: >>>! In T297913#7782921, @RobH wrote: >> Echo of my testing so far: >> >> setting the drive info via show and setting it to on or offline works, b... [19:05:09] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:05] (03PS1) 10JMeybohm: Introduce cert-manager alerts [alerts] - 10https://gerrit.wikimedia.org/r/771687 (https://phabricator.wikimedia.org/T304092) [19:12:03] jeena: now, backport time :D https://gerrit.wikimedia.org/r/q/owner:Ladsgroup%2540gmail.com [19:12:18] five patches [19:12:25] :O [19:12:36] lmk how I can assist [19:12:54] you want on wmf.26? [19:13:26] yeah [19:13:40] okie dokie, let's start [19:14:22] (03PS1) 10Ladsgroup: Revert "rdbms: provide $owner argument in LoadBalancer::flushPrimarySessions()" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771658 [19:15:01] (03CR) 10Ladsgroup: [C: 03+2] Revert "rdbms: provide $owner argument in LoadBalancer::flushPrimarySessions()" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771658 (owner: 10Ladsgroup) [19:20:34] (03PS1) 10Ladsgroup: Revert "rdbms: Followups to automatic connection recovery patch" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771660 [19:20:46] (03PS2) 10Ladsgroup: Revert "rdbms: Followups to automatic connection recovery patch" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771660 [19:22:03] (03CR) 10Ladsgroup: [C: 03+2] Revert "rdbms: Followups to automatic connection recovery patch" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771660 (owner: 10Ladsgroup) [19:29:51] Amir1, any objections to backport the reverts (when they apply) to REL1_38? [19:30:21] zabe: go ahead, do you need +2? [19:30:41] i can do it [19:31:10] awesome, thanks for doing it! [19:36:01] (03Merged) 10jenkins-bot: Revert "rdbms: provide $owner argument in LoadBalancer::flushPrimarySessions()" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771658 (owner: 10Ladsgroup) [19:38:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "rdbms: Followups to automatic connection recovery patch" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771660 (owner: 10Ladsgroup) [19:40:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:41:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:38] (03CR) 10Ladsgroup: [C: 03+2] "retrying" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771660 (owner: 10Ladsgroup) [19:44:09] if it fails again, I'll force merge it [19:44:29] okay [19:44:49] (03PS1) 10Lucas Werkmeister: Remove changetags right from users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771691 (https://phabricator.wikimedia.org/T303682) [19:47:25] (03PS2) 10Lucas Werkmeister: Remove changetags right from users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771691 (https://phabricator.wikimedia.org/T303682) [19:53:08] (03CR) 10Aaron Schulz: rdbms: fix owner id and RELEASE_ALL_LOCKS query in session flushing methods (031 comment) [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (https://phabricator.wikimedia.org/T292239) (owner: 10Aaron Schulz) [19:54:26] (03CR) 10Aaron Schulz: "CI is still broken without https://gerrit.wikimedia.org/r/c/mediawiki/core/+/771496/" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (https://phabricator.wikimedia.org/T292239) (owner: 10Aaron Schulz) [19:55:06] (03PS1) 10Ladsgroup: SuiteEventsTrait: don't call setUp() for an empty suite [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771662 (https://phabricator.wikimedia.org/T292239) [19:55:13] (03CR) 10Ladsgroup: [C: 03+2] SuiteEventsTrait: don't call setUp() for an empty suite [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771662 (https://phabricator.wikimedia.org/T292239) (owner: 10Ladsgroup) [19:56:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300775)', diff saved to https://phabricator.wikimedia.org/P22795 and previous config saved to /var/cache/conftool/dbconfig/20220317-195613-marostegui.json [19:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:18] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [19:59:22] (03Merged) 10jenkins-bot: Revert "rdbms: Followups to automatic connection recovery patch" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771660 (owner: 10Ladsgroup) [20:00:04] brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220317T2000). [20:00:04] cjming and xSavitar: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] o/ [20:00:33] o/ [20:01:23] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:03:13] holding backport window 'til train blockers are clear. [20:04:15] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:07:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:20] Amir1: would sneaking in a config patch disrupt your backports? [20:07:32] thcipriani: none at all [20:07:39] (thank you for the backports by the way <3) [20:07:42] (03PS4) 10C. Scott Ananian: Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) [20:07:42] here o/ sorry i'm late [20:07:56] Amir1: I'm going to push out a quick one then [20:07:59] hey cjming [20:08:14] Amir1, cjming I snuck in a config patch onto the backports list too. [20:08:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:12] (03CR) 10RLazarus: [C: 03+2] kubernetes: Upgrade default envoy version to 1.18.3 [puppet] - 10https://gerrit.wikimedia.org/r/771053 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [20:09:20] (03PS2) 10RLazarus: kubernetes: Upgrade default envoy version to 1.18.3 [puppet] - 10https://gerrit.wikimedia.org/r/771053 (https://phabricator.wikimedia.org/T300324) [20:09:36] Is 1.38-wmf.26 still blocked? [20:09:41] 10SRE, 10ops-eqiad: rack spare switches in c1-eqiad - https://phabricator.wikimedia.org/T185337 (10Cmjohnson) 05Open→03Resolved [20:09:42] cscott: jup [20:10:26] Any chance it gets deployed tomorrow, or are we going to push wmf.26 into the trainsperiment week? [20:10:41] I can push stuff tomorrow morning if needed [20:10:55] by morning I mean 8:00 UTC / 12 hours from now [20:11:05] (03Merged) 10jenkins-bot: SuiteEventsTrait: don't call setUp() for an empty suite [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771662 (https://phabricator.wikimedia.org/T292239) (owner: 10Ladsgroup) [20:11:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22796 and previous config saved to /var/cache/conftool/dbconfig/20220317-201118-marostegui.json [20:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:12] (because of T298046 and a dependency between core and Parsoid, Parsoid can't run its regression tests until wmf.26 rolls all the way out) [20:12:13] T298046: Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on scandium - https://phabricator.wikimedia.org/T298046 [20:13:07] I'm planning to roll forward after we get the reverts Amir is working on done [20:13:15] (03PS1) 10SBassett: Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771696 (https://phabricator.wikimedia.org/T304111) [20:14:20] jeena: if something is blocked for group 2 later today and that requires some code by Amir, I can move things tomorrow [20:14:37] thanks hashar [20:14:57] and I might get assistance from the mediawiki savvy people who are in my timezone [20:15:38] cscott: you just have a config patch, is that right? [20:16:30] thcipriani: yep [20:16:44] I can get that out quickly while we're waiting for other backports (I hope) [20:16:49] (03CR) 10Thcipriani: [C: 03+2] Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) (owner: 10C. Scott Ananian) [20:17:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Jclark-ctr) [20:17:35] ok, so we haven't given up on wmf.26 for this week yet, yay! [20:17:40] (talked to other folks in the window, and mentioned I wanted to delay for the train backports in progress) [20:17:53] hopefully i can kick off a run of parsoid's regression tests against our latest version tomorrow then. [20:19:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:18] (03Merged) 10jenkins-bot: Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) (owner: 10C. Scott Ananian) [20:19:47] Amir1: Remaining is a cherry pick to wmf.26 of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/771657, right? [20:20:05] cscott: anyway to test your change on mwdebug? If so it's live on mwdebug1002 now. [20:20:10] jeena: two this https://gerrit.wikimedia.org/r/c/mediawiki/core/+/771657/2 and the top [20:20:19] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/771655/2 [20:20:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:20:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:24] ah ok, thanks [20:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:53] (03PS1) 10Ladsgroup: Revert "rdbms: make automatic connection recovery apply to more cases" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771663 [20:22:03] (03CR) 10Ladsgroup: [C: 03+2] Revert "rdbms: make automatic connection recovery apply to more cases" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771663 (owner: 10Ladsgroup) [20:22:51] thcipriani: yes, i think I can test on mwdebug, let me see [20:24:37] (basically, https://en.wikipedia.org/api/rest_v1/page/html/Dog and https://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Dog should give the same content; this patch is turning off the second one, and I should be able to test that with mwdebug) [20:26:05] cscott: cool, I get a 404 from mwdebug1002 for that second url -- does that mean it's good to sync everywhere? [20:26:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P22797 and previous config saved to /var/cache/conftool/dbconfig/20220317-202623-marostegui.json [20:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:40] thcipriani: yep. [20:27:49] cool, going live :) [20:28:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:28:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:42] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:20] !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:763779|Revert "Enable Parsoid API everywhere" (T302081)]] (duration: 00m 50s) [20:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:26] ^ cscott should be live [20:30:41] now I'm holding off on the rest of the backport window until post the rdbms reverts [20:32:41] thcipriani: looks good to me [20:32:48] great :) [20:33:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:25] FYI: this was an internal REST API that's live on the parsoid cluster and accessed via internal network from restbase, but was never meant for direct public use (it's completely uncached, for example) [20:33:25] thcipriani: Happy to be useful sometimes, let me know if I can help on anything [20:34:00] Tim turned on public access to it a few months ago in order to do some performance benchmarking, but I wanted to make sure it got turned off before anyone actually discovered it and started using it. [20:34:59] Amir1: you're useful all the times :) [20:35:26] but if for some reason this did escape and something important ended up with the wrong URL endpoint, this config change is perfectly safe to revert [20:35:38] "before anyone actually discovered it" cf: https://www.hyrumslaw.com/ [20:35:45] *exactly* [20:35:49] good to know [20:35:56] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) [20:37:27] thcipriani: Hyrum is the main author of my favorite books in software engineering (https://www.goodreads.com/book/show/48816586-software-engineering-at-google) [20:37:38] it's a bit long (600-ish pages) but it's really good [20:38:17] (03Merged) 10jenkins-bot: Revert "rdbms: make automatic connection recovery apply to more cases" [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771663 (owner: 10Ladsgroup) [20:38:34] oh! I've seen this book and I know the eponymous law, but I've never put that together! [20:39:05] jeena: okay, wish me luck, I'm going to deploy all together, it's going to be "fun" [20:39:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ml-cache1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:20] Amir1: yay 😅 [20:39:32] good luck 🍀 [20:39:42] I'll watch the logs [20:39:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ml-cache1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ml-cache1003.mgmt.eqiad.wmnet with reboot policy FORCED [20:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300775)', diff saved to https://phabricator.wikimedia.org/P22798 and previous config saved to /var/cache/conftool/dbconfig/20220317-204128-marostegui.json [20:41:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:41:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:33] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [20:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) [20:43:35] jeena: during the deploy, there will be a lot of errors because of random order of arrival of files [20:43:55] okay [20:44:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:45:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:41] !log ladsgroup@deploy1002 Started scap: Revert "rdbms: Followups to automatic connection recovery patch" [20:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:50] going with sync-world [20:45:54] there is no easy way :/ [20:45:59] heheh [20:46:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:42] haha, there wasn't much [20:49:59] oh did it finish already? [20:50:14] nope [20:50:16] on the fly [20:50:20] ah [20:50:24] but canaries didn't go weeeeeeeee [20:50:31] 32% [20:53:28] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) Awesome, thanks. I will deploy this tomorrow. [20:54:11] (03CR) 10Ladsgroup: "I will deploy this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [20:54:52] jeena: apache synced [20:54:58] woohoo [20:55:08] Thanks Amir1! [20:55:24] hopefully errors should go down [20:56:02] There are a few new errors, maybe resulting from the deployment that I'll watch [20:57:31] !log ladsgroup@deploy1002 Finished scap: Revert "rdbms: Followups to automatic connection recovery patch" (duration: 11m 50s) [20:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:14] jeena: what is the dashboard? mediawiki-NEW-errors? [20:59:16] (03PS7) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [21:00:36] I'm trying to load that but having trouble. I see " i/l/r/d/DatabaseMysqli:42 PHP Fatal Error: Class Wikimedia\Rdbms\DatabaseMysqli" and "i/l/r/d/DBConnRef:30 PHP Fatal Error: Class Wikimedia\Rdbms\DBConnRef contains 1 abstract method" on logspam-watch [21:01:05] actually they both have the same "contains 1 abstract method" error message [21:01:33] but the errors haven't increased since I noticed them [21:01:45] that seems to be related to the sync of the reverts [21:01:53] yeah that's what I was wondering too [21:02:06] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [21:03:16] looks like the last of those errors was at 55 past the hour [21:03:27] a little surge of them [21:03:35] but ended after the sync [21:03:45] the revert removes the flushSession method from IDatabase and Database/DBConnRef and thus due to the random arrival of files there was the situation of the method being defined in IDatabase but not DBConnRef, resulting in the fatal [21:04:03] ok, sounds like a weird rsync artifact [21:04:15] "fun" :D [21:04:28] It seems like we are good to proceed then [21:05:01] +1 [21:05:18] I close the subtickets [21:05:32] 👍 [21:05:49] thcipriani: do you want to do the backports before I roll forward? [21:06:24] jeena: that's what we were just discussing, I think I'll let you roll forward to group1 and then run backports if that's ok with you? [21:06:49] less time pressure that way maybe? More time to let group1 bake? what do you think of that plan? [21:06:51] okay [21:06:54] sounds good [21:06:57] cool [21:08:29] (03PS1) 10Jeena Huneidi: group1 wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771702 [21:08:31] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771702 (owner: 10Jeena Huneidi) [21:09:12] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771702 (owner: 10Jeena Huneidi) [21:10:25] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.26 refs T300202 [21:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:30] T300202: 1.38.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T300202 [21:11:16] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.26 refs T300202 (duration: 00m 50s) [21:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:54] (03CR) 10Clare Ming: [C: 03+2] Update invalid skin preference update script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771394 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [21:14:39] (03Merged) 10jenkins-bot: Update invalid skin preference update script [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771394 (https://phabricator.wikimedia.org/T299104) (owner: 10Jdlrobson) [21:14:46] (03PS1) 10D3r1ck01: Add & improve message for the chapter/thorg application contact form [extensions/WikimediaMessages] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771665 [21:16:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:27] (03Abandoned) 10Aaron Schulz: rdbms: fix owner id and RELEASE_ALL_LOCKS query in session flushing methods [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/770938 (https://phabricator.wikimedia.org/T292239) (owner: 10Aaron Schulz) [21:17:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:17:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:41] !log cjming@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/WikimediaMaintenance/T299104.php: Backport: [[gerrit:771394|Update invalid skin preference update script (T299104)]] (duration: 00m 51s) [21:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:45] T299104: Prepare script to update invalid user preferences after skins have been separated - https://phabricator.wikimedia.org/T299104 [21:22:35] rolling out the envoy upgrade to a number of k8s services -- no conflict with the B&C window, it'll just be noisy in here :) any objections? [21:23:09] cjming: ^ [21:23:24] that should be fine [21:23:33] fine by me [21:23:37] thanks! going [21:23:54] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [21:23:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:14] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [21:24:15] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [21:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:30] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [21:24:31] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [21:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:52] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [21:24:53] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [21:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:20] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [21:25:21] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [21:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:25:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:48] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [21:25:49] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [21:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:16] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [21:26:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:26:17] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [21:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:43] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [21:26:44] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [21:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:15] cscott: could something this this be cause by what I backported for you? https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=(time:(from:'2022-03-16T21:24:55.000Z',to:'2022-03-17T21:28:27.316Z')) [21:30:52] (error message: Wikimedia\Assert\InvariantException: Invariant failed: Expecting : in parser function definiton for some rest.php urls) [21:34:07] (03CR) 10Thcipriani: [C: 03+2] Add & improve message for the chapter/thorg application contact form [extensions/WikimediaMessages] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771665 (owner: 10D3r1ck01) [21:35:53] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [21:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:31] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [21:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:33] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [21:41:35] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [21:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:43] (03PS1) 10Brennen Bearnes: Revert "Revert "Enable Parsoid API everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771707 [21:41:51] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [21:41:52] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [21:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:04] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Revert "Enable Parsoid API everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771707 (owner: 10Brennen Bearnes) [21:42:13] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [21:42:14] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [21:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:34] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [21:42:35] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:48] (03Merged) 10jenkins-bot: Revert "Revert "Enable Parsoid API everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771707 (owner: 10Brennen Bearnes) [21:43:35] (03CR) 10Volans: "Given that fixing the noqa ignore at the top of the file has unveiled a lot of other style/formatting issues, up to you if you prefer to s" [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [21:44:09] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:44:10] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [21:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:31] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [21:44:31] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [21:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:52] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [21:44:53] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [21:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:15] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [21:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:45] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10Ladsgroup) This would greatly help us in finding issues caused by the rollout of video.js as our new video player (see {T100106} and subtickets) which TheDJ has basically built it. [21:47:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:47:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:23] !log brennen@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771707|Revert "Revert "Enable Parsoid API everywhere""]] (duration: 00m 51s) [21:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:03] (03Merged) 10jenkins-bot: Add & improve message for the chapter/thorg application contact form [extensions/WikimediaMessages] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771665 (owner: 10D3r1ck01) [21:52:03] jeena: is everything good with the train? [21:53:08] We're still doing backports and also investigating whether this error should affect the train https://phabricator.wikimedia.org/T304118 [21:53:35] thcipriani just said it shouldn't hold the train [21:55:06] cool, I go afk then, my number is in contact list. If suddenly things go really bad, call [21:55:22] Thanks for all your help Amir1 [21:55:34] sorry didn't realize you were waiting for roll to all wikis [21:55:50] (03PS1) 10Brennen Bearnes: Revert "Revert "Revert "Enable Parsoid API everywhere""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771712 [21:56:12] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Revert "Revert "Enable Parsoid API everywhere""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771712 (owner: 10Brennen Bearnes) [21:57:03] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Enable Parsoid API everywhere""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771712 (owner: 10Brennen Bearnes) [22:00:19] !log brennen@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:771712|Revert "Revert "Revert "Enable Parsoid API everywhere"""]] (duration: 00m 51s) [22:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:05:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:49] !log derick@deploy1002 Started scap: Backport: [[gerrit:771665|Add & improve message for the chapter/thorg application contact form]] [22:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:26] !log derick@deploy1002 Finished scap: Backport: [[gerrit:771665|Add & improve message for the chapter/thorg application contact form]] (duration: 11m 37s) [22:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:00] (03CR) 10Thcipriani: [C: 03+2] Add new field to capture application URL link on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771606 (owner: 10D3r1ck01) [22:19:54] (03Merged) 10jenkins-bot: Add new field to capture application URL link on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771606 (owner: 10D3r1ck01) [22:21:07] 10SRE, 10envoy, 10serviceops: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10RLazarus) [22:25:18] (03PS1) 10Cwhite: beta-logs: use new kafka truststore [puppet] - 10https://gerrit.wikimedia.org/r/771737 (https://phabricator.wikimedia.org/T300130) [22:26:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:33] 10SRE, 10envoy, 10serviceops: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10RLazarus) 05Open→03Stalled p:05Triage→03Low [22:26:37] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [22:28:02] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) >>! In T300130#7786731, @elukey wrote: > What do you think? The new truststore works. Let's have Logstash use it. [22:28:15] (03PS8) 10Bking: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) [22:28:53] !log derick@deploy1002 Synchronized wmf-config/MetaContactPages.php: Config: [[gerrit:771606|Add new field to capture application URL link on Meta]] (duration: 00m 50s) [22:28:53] (03CR) 10Cwhite: [C: 03+2] beta-logs: use new kafka truststore [puppet] - 10https://gerrit.wikimedia.org/r/771737 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite) [22:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:43] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:31:53] Backport window is over. Deploying wmf.26 to all wikis [22:32:19] (03PS1) 10Jeena Huneidi: all wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771740 [22:32:21] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771740 (owner: 10Jeena Huneidi) [22:33:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:33:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:29] (03PS9) 10Ryan Kemper: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:34:59] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.26 refs T300202 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771740 (owner: 10Jeena Huneidi) [22:35:14] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [22:36:10] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.26 refs T300202 [22:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:14] T300202: 1.38.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T300202 [22:37:18] (03PS10) 10Ryan Kemper: elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:39:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:53] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: remove custom restart handling [cookbooks] - 10https://gerrit.wikimedia.org/r/771072 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:44:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:51:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:30] (03Abandoned) 10Bartosz Dziewoński: tests: Fix @group Broken on MediaWikiIntegrationTestCaseSchemaTest [core] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771398 (https://phabricator.wikimedia.org/T292239) (owner: 10Hashar) [23:12:21] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:44:39] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook